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Foreword 



The papers contained in this volume were presented at the 11th Conference on 
String Processing and Information Retrieval (SPIRE), held Oct. 5-8, 2004 at 
the Department of Information Engineering of the University of Padova, Italy. 
They were selected from 123 papers submitted in response to the call for papers. 
In addition, there were invited lectures by C.J. van Rijsbergen (University of 
Glasgow, UK) and Setsuo Arikawa (Kyushu University, Japan). In view of the 
large number of good-quality submissions, some were accepted this year also as 
short abstracts. These also appear in the proceedings. 

Papers solicited for SPIRE 2004 were meant to constitute original contribu- 
tions to areas such as string pattern searching, matching and discovery; data 
compression; text and data mining; machine learning; tasks, methods, algo- 
rithms, media, and evaluation in information retrieval; digital libraries; and ap- 
plications to and interactions with domains such as genome analysis, speech and 
natural language processing, Web links and communities, and multilingual data. 

SPIRE has its origins in the South American Workshop on String Process- 
ing which was first held in 1993. Starting in 1998, the focus of the symposium 
was broadened to include the area of information retrieval due to the common 
emphasis on information processing. The first 10 meetings were held in Belo Hor- 
izonte (Brazil, 1993), Valparaiso (Chile, 1995), Recife (Brazil, 1996), Valparaiso 
(Chile, 1997), Santa Cruz (Bolivia, 1998), Cancun (Mexico, 1999), A Coruna 
(Spain, 2000), Laguna San Rafael (Chile, 2001), Lisbon (Portugal, 2002), and 
Manaus (Brazil, 2003). 

SPIRE 2004 was held as part of Dialogues 2004, a concerted series of confer- 
ences and satellite meetings fostering exchange and integration in the modeling, 
design and implementation of advanced tools for the representation, encoding, 
storage, search, retrieval and discovery of information and knowledge. In Dia- 
logues 2004, the companion conferences for SPIRE were the 15th International 
Conference on Algorithmic Learning Theory and the 7th International Confer- 
ence on Discovery Science. 

The Program Committee consisted of: Rakesh Agrawal, IBM Almaden, USA, 
Maristella Agosti, Univ. of Padova, Italy, Amihood Amir, Bar-Ilan Univ., 
Israel and Georgia Tech, USA, Alberto Apostolico, Univ. of Padova, Italy and 
Purdue Univ., USA ('CTiazr), Ricardo Baeza- Yates, Univ. of Chile, CTizfe, Krishna 
Bharat, Google Inc., USA, Andrei Broder, IBM T.J. Watson Research Cen- 
ter, Hawthorne, USA, Maxime Crochemore, Univ. of Marne la Vallee, France, 
Bruce Croft, Univ. of Massachusetts Amherst, C/6'A, Pablo de la Fuente, Univ. of 
Valladolid, Spain, Edleno S. de Moura, Univ. Federal of Amazonas, Brazil, Ed- 
ward Fox, Virginia Tech., USA, Norbert Fuhr, Univ. of Duisburg, Germany, 
Raffaele Giancarlo, Univ. of Palermo, Italy, Roberto Grossi, Univ. of Pisa, 
Italy, Costas Iliopoulos, King’s College London, UK, Gad M. Landau, Univ. of 
Haifa, Israel, Joao Meidanis, Univ. of Campinas, Brazil, Massimo Melucci, Univ. 
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Abstract. Real Scaled Matching refers to the problem of finding all lo- 
cations in the text where the pattern, proportionally enlarged according 
to an arbitrary real-sized scale, appears. Real scaled matching is an im- 
portant problem that was originally inspired by Computer Vision. 

In this paper, we present a new, more precise and realistic, definition 
for one dimensional real scaled matching, and an efficient algorithm for 
solving this problem. For a text of length n and a pattern of length m, 
the algorithm runs in time 0(n log m -(- ydrm^^^ydogTn). 



1 Introduction 

The original classical string matching problem [10, 13] was motivated by text 
searching. Wide advances in technology, e.g. computer vision, multimedia li- 
braries, and web searches in heterogeneous data, have given rise to much study 
in the field of pattern matching. 

Landau and Vishkin [15] examined issues arising from the digitization pro- 
cess. Once the image is digitized, one wants to search it for various data. A whole 
body of literature examines the problem of seeking an object in an image. 

In reality one seldom expects to find an exact match of the object being 
sought, henceforth referred to as the pattern. Rather, it is interesting to find all 
text locations that “approximately” match the pattern. The types of differences 
that make up these “approximations” are: 

1. Local Errors - introduced by differences in the digitization process, noise, 
and occlusion (the pattern partly obscured by another object). 

2. Scale - size difference between the image in the pattern and the text. 

3. Rotation - angle differences between the pattern and text images. 

* Partially supported by ISF grant 282/01. Part of this work was done when the 
author was at Georgia Tech, College of Computing and supported by NSF grant 
CCR-01-04494. 
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Some early attempts to handle local errors were made in [14] . These results 
were improved in [8] . The algorithms in [8] heavily depend on the fact that the 
pattern is a rectangle. In reality this is hardly ever the case. In [6], Amir and 
Farach show how to deal with local errors in non-rectangular patterns. 

The rotation problem is to find all rotated occurrences of a pattern in an 
image. Fredriksson and Ukkonen [12], made the first step by giving a reasonable 
definition of rotation in discrete images and introduce a filter for seeking a ro- 
tated pattern. Amir et al. [2] presented an 0{n^m^) time algorithm. This was 
improved to 0{n^m^) in [7]. 

For scaling it was shown [9, 5] that all occurrences of a given rectangular 
pattern in a text can be found in all discrete scales in linear time. By discrete 
scales we mean natural numbers, i.e. the pattern scaled to sizes 1, 2, 3, . . .. 

The first result handling real scales was given in [3]. In this paper, a linear 
time algorithm was given for one-dimensional real scaled matching. In [4], the 
problem of two-dimensional real scaled matching was defined, and an efficient 
algorithm was presented for this problem. 

The definition of one-dimensional scaling in [3] has the following drawback: 
For a pattern P of length m and a scale r, the pattern P scaled by r can have 
a length which is far from mr. In this paper, we give a more natural definition 
for scaling, which has the property that the length of P scaled by r is mr 
rounded to the nearest integer. This definition is derived from the definition of 
two-dimensional scaling which was given in [4] . We give an efficient algorithm for 
the scaled matching problem under the new definition of scaling: For a text T of 
length n and a pattern P of length m, the algorithm finds in T all occurrences 
of P scaled to any real value in time 0(n log m -I- 

Roadmap: In section 2 we give the necessary preliminaries and definitions of the 
problem. In section 3 we present a simple algorithm that straightforwardly finds 
the scaled matches of the pattern in the text. In section 4 we present our main 
result, namely, the efficient algorithm for real scaled matching. 

2 Scaled Matching Definition 

Let T and P be two strings over some finite alphabet E. Let n and m be the 
lengths of T and P, respectively. 

Definition 1. A pixel is an interval {i — l,f] on the real line K, where i is an 
integer. The center of a pixel {i — l,i] is its geometric center point, namely the 
point i — 0.5. 

Definition 2. Let r S M, r > 1. The r-ary pixel array for P consists of m 
intervals of length r, which are called r-intervals. The i-th r-interval is {{i — 
l)r,ir]. Each interval is identified with the value from E: The i-th interval is 
identified with the i-th letter of P. For each pixel center that is inside some r- 
interval, we assign the letter that corresponds to that interval. The string obtained 
by concatenating all the letters assign to the pixel centers from left to right is 
denoted by P’’, and called P scaled to r. 
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The scaled matching problem is to find all the locations in the text T in which 
there is an occurrence of P scaled to some r > 1. 

Example 1. Let P = aabccc then pi = aaabbcccc, and let T = aabdaaaabb 
cccccb. There is an occurrence of P scaled to | at text location 6. 

Let X G K. ||x|| denotes the rounding of x, i.e. 

[xj if the fraction part of x is less than .5 
[x] otherwise. 

We need the following technical claim. 

Claim. Let fc, k' , and I be some positive integers. 

1. {r| \\rk\\=l}=[i=^,i±^). 

2. {r| \\r{k + k')\\-\\rk'\\=l}cC-^,^-^). 

Proof. The first part follows immediately from the definition of rounding. To 
prove the second part, note that x — 0.5 < ||x|| < x + 0.5, so 

||r(fc + fc')|| — lirfc'll < r{k + k') + 0.5 — {rk' — 0.5) = rfc + 1 

and 

\\r{k + fc')|| — \\rk'\\ > r{k + k') — 0.5 — (rk' + 0.5) = rk — 1. 

Hence, the r’s that satisfy ||r(fc + fc')|| — ||rfc'|| = I are those that satisfy rk 
I < rk + 1, which implies < r < 

3 A Local Verification Algorithm 

One possible straightforward approach to solving the scaled matching problem 
is to verify for each location of the text whether the scaled occurrence definition 
holds at that location. However, even this verification needs some clarification. To 
simplify, we define below the symbol form and the run-lengths of the symbols 
separately. This will give a handle on the verification process. An additional 
benefit is that this representation also compresses the text and pattern and 
hence will lead to faster algorithms. 

Definition 3. Let S = a\a 2 • ■ • <Jn be a string over some alphabet S. The run- 
length representation of string S is the string S' = such that: 

(1) cr' yf for 1 < i < k; and (2) S can be described as concatenation of 
the symbol crj repeated r\ times, the symbol a '2 repeated X 2 times, . . . , and the 
symbol a'f. repeated rk times. We denote by = cricr^ • • • cr^, the symbol part 
of S' , and by c{S) = r\r 2 • • - rk, the run-length part of S' (c{S) is a string over 
the alphabet of natural numbers). 

The locator function between S and S' is locs(i) = j, where j is the index 
for which X)Ci D < * < Yli=i P- 

The center of S, denoted Cs, is the substring of S that contains all the letters 
of S except the first ri letters and the last rk letters. 
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Example 2. Let S = aaaaaabbcccaabbbddddd then S' = a^b'^c^a'^b^d^ , = 

abcabd and c(S') = 623235. The locator function is locs(l) = l,locs(2) = 
1, . . . , locs(6) = 1, locs(7) = 2, . . . , locs(21) = 6. The center of S is bbcccaabbb. 

3.1 Reformulating the Definition 

Scaled matching requires finding all scaled occurrences of P in T. To achieve this 
goal we will use P' and T' . There are two requirements for a scaled occurrence 
of P at a given location of T. The first is that P^ matches a substring of 
beginning at location locT(f) of T. This can be verified in linear time with any 
classical pattern matching algorithm, e.g. [13]. The second requirement is that 
there is a scale r for which P scales properly to match at the appropriate location 
in T. For this we will use c(P) and c(T). Since the first requirement is easy to 
verify, from here on we will focus on the second requirement. 

Denote m' = |c(P)| and n' = |c(T)|. We also denote c(P) = p\, . . . ,pm' and 
c(T) = ti, . . . ,tn>- We assume that m' > 3 and n' > 3 otherwise the problem is 
easily solvable in linear time. When a scaled match occurs at location z of T the 
location in the compressed text that corresponds to it is j = 1oct(*). However, 
only part of the full length of tj may need to match the scaled pattern. More 
precisely, a length ti = t; + 1 — * piece of tj needs to match the scaled 
pattern. The full set of desired scaling requirements follows. 



Scaling requirements at text location i, where 1oct(*) =j 


Ibi • ’"II = ii 




\\{pi+P 2 )-r\\ =ii + tj+i 




II {Pi + • • • + Pk) • 'c|| = ii + tj+i + • ' 


■ ■ + tjj-k-i 


II {Pl + • • • + Pm'-l) • 'c|| = P + tj+l + • ' 


■ ■ + tj+rn'-2 


\\m ■ r|| < ii + tj+i + ■ ' 


tjj-jji' — i 



The following claim follows from the discussion above and its correctness 
follows directly from the definition. 

Claim. Let P be a pattern and T a text. There is a scaled occurrence of P 
at location z of T iff P^ matches at location 1oct(*) of and the scaling 
requirements for location z are satisfied. 

3.2 The Algorithm 

The importance of verifying the scaling requirements efficiently follows from 
Claim 3.1. The claim below shows that this can be done efficiently. 
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Claim. Let z be a location of T . The scaling requirements for i can be verified 
in 0{m') time. 

Proof. The scaling requirements are verified by finding the set of all scales r that 
satisfy requirements. By Claim 2, for each of the first m' — 1 scaling requirements, 
the set of all r’s that satisfy the requirement is an interval of length 1. Moreover, 
the last scaling requirement demands that r G [1, (C + tj+i + • • - tj+m'-i + 

0.5) /m). The intersection of these intervals is set of all r’s for which P’’ appears 
in location i. This intersection can be found in 0{m') time. □ 

The following straightforward algorithm can now be devised: The algorithm 
checks the scaling requirement for every location i of T . 

Running Time: There are n— m+1 locations in T. For each location the existence 
of a scaled match can be checked in 0{m') time by Claim 3.2. So the overall 
time of the algorithm is 0(nm'). 

4 A Dictionary Based Solution 

A different approach for solving the scaled matching problem is to create a dictio- 
nary containing the run length part of P scaled at all possible scales. Substrings 
of the compressed text can then be checked for existence in the dictionary. The 
problem with this solution is that there may be many scales and hence many dif- 
ferent strings in the dictionary. In fact, the dictionary to be created could be as 
large as, or larger than, the running time of the naive algorithm. To circumvent 
this problem we will keep in the dictionary only scaled instances of the pattern 
with a scale at most a, where the value of a will be determined later. Checking 
for occurrences of the pattern with a scale larger than a will be performed using 
the algorithm from the previous section. 

To bound the number of strings in the dictionary, we use the following lemma, 
which is a special case of Lemma 1 in [4] . 

Lemma 1. Let pattern P be scaled to size I > m. Then there are k < m' 
intervals, [oi, 02 ), [ 02 , 03 ), . ■ . , [ofc, Ofe+i), where a\ < 02 <••• < Ofe+i for which 
the following hold: 

1. P^^ = P'"^ if ri and r 2 are in the same interval. 

2. P'’! P’’^ if ri and V 2 are in different intervals. 

3. P’’ has length I if and only if r G [ai,afe+i). 

Proof. By Claim 2, if the length of P'’ is I, then r belongs to the interval I = 
Consider the r-ary pixel array for P when the value of r goes from 
^ to r = . Consider some value of r in which a new scaled pattern 

is reached, namely P’’ yf pr-<^ fQj. every e > 0. By definition, this happens when 
the right endpoint of some r-interval coincides with some pixel center. The right 
endpoint of the rightmost r-interval moves a distance of exactly 1 when r goes 
over I, and each other endpoint moves a distance smaller than 1. In particular, 
each endpoint coincides with a pixel center at most one time. □ 
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4.1 Building the Dictionary 

Let 7^ be a set containing c{Cpr) for every scaled pattern P’’ with r < a. By 
Lemma 1, the number strings in V is 0{amm'). For each P' G V, let Rp> be the 
set of all values of r such that c(C'pr) = P' . Each set Rp' is a union of intervals, 
and let |i?p'| denote the number the intervals in i?p/. 

Example 3. Let P = abed and a = 2. Then, 

'nil for r G [1,1.125) 

1112 for r G [1.125,7/6) 

1121 for r G [7/6,1.25) 

1211 for r G [1.25,1.375) 

I 1212 for r G [1.375,1.5) 

^ “ I 2121 for r G [1.5, 1.625) 

2122 for r G [1.625,1.75) 

2212 for r G [1.75,11/6) 

2221 for r G [11/6,1.875) 

^2222 for r G [1.875,2] 

Thus, V = {11,12,21,22}, i?ii = [1,7/6), P 12 = [7/6, 1.25) U [1.5, 1.75), P 21 = 
[1.25, 1.5) U [1.75,11/6), and P 22 = [H/6,2]. 

Lemma 2. For every P' G V, |i?p'| = 0{m). 

Proof. Let I be the sum of the characters of P', and let a be the number of 
characters in Cp, namely a = p 2 + • • • +Pm>-i- If r G Rp', then ||r(pi + a)|| — 
||rpi|| = 1. By Claim 2 we obtain that r G (^,^)- For r G the 

length of P’’ is in the interval [ ll-^mll ]. From Lemma 1, it follows 

that [Pp'l = 0{m' ■ mfa). We have assume that m' > 3, so a > m' — 2 > \rn' . 
Hence, |Pp'| = 0{m). □ 

We now describe how to build the dictionary. Instead of storing in the dic- 
tionary the actual strings of P, we will assign a unique name for every string in 
P using the fingerprinting algorithm of Amir et al. [1], and we will store only 
these names. 

The first step is generating all the scaled patterns P’’ for r < a in an increas- 
ing order of r. This is done by finding all the different values of r in which the 
right endpoint of some r-interval coincides with a pixel center. These values can 
be easily found for each r-interval, and then sorted using bin sorting. Denote the 
sorted list by L. 

Afterward, build an array A[\..m' — 2] that contains the run length part of 
Cp. For simplicity, assume that m' — 2 is a power of 2 (otherwise, we can append 
zeros to the end of A until the size of A is a power of 2) . Now, compute a name 
for A by giving a name for every sub-array of A of the form A[j2* -|- l..(j -|- 1)2*]. 
The name given to a sub-array A[j..j] is equal to its content. The name given 
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to a sub-array A' = + l..{j + 1)2*] depends on the names given to the two 

sub-arrays 42j2*"i -h l..(2j -h l)2*"i] and A[{2j + l)2*"i -h l..(2j -h 2)2*-i], If 
the names of these sub-arrays are a and 6, respectively, then check whether the 
pair of names (a, b) was encountered before. If it was, then the name of A' is the 
name that was assigned to the pair (a,b). Otherwise, assign a new name to the 
pair (a, b) and also assign this name to A'. 

After naming A, traverse L, and for each value r in L, update A so it will 
contain the run length part of Cpr. After each time a value in A is changed, 
update the names of the log(m' — 2) -|- 1 sub-arrays of A that contain the position 
of A in which the change occurred. 

During the computation of the names we also compute the sets i?p< for all 
P' G V. 

Running Time: We store the pairs of names in an L x L table, where L = 
0{amm!) is an upper bound on the number of distinct names. Thus, the initial 
naming of A takes 0{m') time, and each update takes 0(log m') time. Therefore, 
the time for computing all the names is 0{amm' \ogm'). Using the approach 
of [11], the space can be reduced to 0{L). 

4.2 Scanning the Text 

The first step is naming every substring of c(T) of length 2* for i = 0, . . . , log(m'— 
2). The name of a substring of length 1 is equal to its content. The name of a 
substring T' of length 2* (i > 0) is computed from the names of the two substrings 
of length 2*“^ whose concatenation forms T' . The naming is done using the same 
L X L array that was used for the naming of A, so the names in this stage are 
consistent with the names in the dictionary building stage. In other words, a 
substring of T that is equal to a string P' from V, will get the same name as P' . 

Now, for every location i of T, compute the range of scales r that satisfy the 
first and last two scaling requirements. If this range is empty, proceed to the 
next i. Otherwise, suppose that this interval is [ri,r 2 ). If T 2 > a, check all the 
other scaling requirements. If r 2 < a, check whether the name of the substring 
of c(T) of length m! — 2 that begins at 1oct(*) -I- 1 is equal to the name of a 
string P' gV. If there is such a string P' , compute [ri, r 2 ) n i?p/, and report a 
match if the intersection is not empty. 

Running Time: Computing the names takes O(n'logm') time. By storing each 
set Rpi using a balanced binary search tree, we can compute the intersec- 
tion [ri,r 2 ) n Rpi in time 0(log]i?p/l) = O(logm) (the equality follows from 
Lemma 2). Therefore, the time complexity of this stage is 0{nlogm + Im'), 
where I is the number of locations in which all the scaling requirements are 
checked. Let Sj be the set of all such locations i with 1oct(*) = j, and let S be 
the set of all indices j for which Sj is not empty. Clearly, I = '^j^s 
following lemmas give an upper bound on 1. 

Lemma 3. For every j, [S'j] = 0(1 +pi/m'). 
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Proof. Fix a value for j. Let a = P 2 + - • •+Pm-i and I = tj+\ + - ■ Sup- 

pose that i € Sj. As the first and second last scaling requirements are satisfied, 
we have that ||r(pi -|- o)|| — ||rpi|| = {ii + 1) — it = 1. By Claim 2, r e (^, 

so rpi G i^Pi, ^Pi)- Since i = ti~\ \-tj + l-U = ti~\ + \\rpi\\ 

and this is true for every z G Sj, it follows that 



\SA < 



I — 1 ^ + 1 

X G 1 pi, Pi 



< 2 + — <2 

a m‘ 



6pi 

/ ' 



where the last inequality follows from the fact that m' > 3. □ 

Lemma 4. IS”] = O(^). 

Proof. Suppose that j G S, and let i be some element in Sj. Let [ri,r 2 ) be 
the scales interval computed for location i using the first and last two scaling 
requirements. By the definition of the algorithm, T 2 > a. The interval of r’s 
that satisfy the first scaling requirement is so we obtain that 

> T 2 > a. Thus, pj > ti > api — 0.5. Since this is true for every j G S, it 
follows that |S'| < 5 . □ 

By Lemmas 3 and 4, I = + ^)) = O(^). Therefore, the total time 

complexity of the algorithm is 0{amm' log m'+n log m+nm' / a) . This expression 
is minimized by choosing a = ^njim log m'). We obtain the following theorem: 

Theorem 1. The scaled matching problem can he solved in 0{n\ogm + 
^/nmm ' Vlog m ' ) time. 
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Abstract. Given a set of strings U = {Ti, T 2 , . . . , T^}, the longest com- 
mon repeat problem is to find the longest common substring that appears 
at least twice in each string of U. We also consider reversed and reverse- 
complemented repeats as well as normal repeats. We present a linear 
time algorithm for the longest common repeat problem. 



1 Introduction 

Repetitive or periodic strings have a great importance in a variety of applications 
including computational molecular biology, data mining, data compression, and 
computer-assisted music analysis. For example, it is assumed that repetitive 
substrings in a biological sequence have important meanings and functions [1]. 
Finding common substrings in a set of strings is also important. For example, 
motifs or short strings common to protein sequences are assumed to represent a 
specific property of the sequences [3]. 

In this paper we want to find common repetitive substrings in a set of strings. 
We especially focus on finding the longest common repeat in a set since the 
number of the common repeats in a set can be quite large. We also consider 
reversed and reverse-complemented strings in finding repeats. Formally we define 
our problem as follows. 

Let T be a string over an alphabet S. We assume S = {A, C, G, T} or 
S = {A, G, G, [/} since a major application of the problem is computational 
molecular biology. T[i] denotes the f-th character of T. T\i..j] is the substring 
T[i]T[i -I- 1] • • ■T[j] of T. denotes the reverse string of T where |T^| = |T| 
and T^[i] = T[|T|— i-bl] for 1 < f < \T\. denotes the reverse-complemented 
string oiT where \T^'^\ = |T| and, T^'^[i] and T[|T|— f-bl] form a Watson-Crick 
pair {A = (T or U) and G = G) for 1 < f < |T|. 

A repeat of T is a substring of T which appears at least twice in T. There 
are three kinds of repeats. 

* Work supported by IMT 2000 Project AB02. 
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— Normal repeat: A stringp is called a normal repeal of T if p = r[L.t+ |p| — l] 
and p = T[i' ..i' + |p| — 1] for i ^ i' . 

— Reversed repeat : A string p is called a reversed repeat of T if p = T[i..i + 

\p\ — 1] and p^ = + |p| — 1]. 

— Reverse-complemented repeat : A string p is called a reverse- complemented 
repeat if p = T[i..i + \p\ — 1] and p^'" = T[i'..i' + \p\ — 1]. 

There are two reasons why we consider reversed and reverse-complemented 
repeats: (i) We don’t know the directions of the strings in advance, (ii) In 
some situations, reversed and reverse-complemented repeats play an impor- 
tant role. For example, RNA secondary structures are determined by reverse- 
complemented repeats. 

The longest common repeat problem can be defined as follows. 

Problem 1. Given a set of strings U = {Tx,T 2 t ■ ■ ^Tl\, the {k,£) longest 
common repeat problem is to find the longest repeat (normal, reversed or re- 
verse-complemented) which are common to k strings in U for I < k < i. 

For finding the longest normal repeat in a text T, Karp, Miller, and Rosenberg 
first proposed 0(|T| log |T|) time algorithm [8]. However, it is an easy application 
of the suffix tree [10, 13,4] to find it in 0(|T|) time. 

For approximate normal repeats, Landau and Schmidt gave an 0{k\T\ logk 
log |T|) time algorithm for finding approximate squares where the allowed edit 
distance is at most k [9]. Schmidt also gave an 0(|Tp log |T|) time algorithm for 
finding approximate tandem or non-tandem repeats [12]. 

The longest common repeat problem resembles the longest common substring 
problem. The difference is that the common substring should appear at least 
twice in each sequence in the longest common repeat problem. For the longest 
common substring problem with a set of strings {Ti,T 2 , . . . , T^}, Hui showed an 
0{J2i=i |7i|) time algorithm [7]. As far as we know, our algorithm is the first 
one that solves the longest common repeat problem. 

2 Preliminaries 

A generalized sujjix tree stores all the suffixes of a set of strings just like a 
suffix tree stores all the suffixes of a string. It is easy to extend the suffix tree 
construction algorithm [13] to building a generalized suffix tree [5, page 116]. 
Figure 1 is an example of the generalized suffix tree for Ti = AACTG and 
T 2 = ACTGCTG. Each leaf node has an ID representing the original string 
where the suffix came. Identical suffixes of two or more strings are considered 
as different ones. In this example, Ti and T 2 share three identical suffixes GTG, 
TG, and G. Each of these suffixes has two leaves with different IDs. 

From now on, let ST{T) denote the suffix tree of T and GST{Ti..Ti) denote 
the generalized suffix tree of Ti,T 2 , . ■ . ,Ti. Let L{v) denote the string obtained 
by concatenating the edge labels on the path from the root to a node z; in a 
suffix tree or a generalized suffix tree. 

We define corresponding nodes between ST(Ti) and GST{Ti..Ti) (1 < z < £)■ 




12 



Inbok Lee, Costas S. Iliopoulos, and Kunsoo Park 




Fig. 1. The generalized suffix tree for Ti = AACTG and T 2 = ACTGCTG. 



Definition 1. The corresponding node of an internal node v in ST(Ti) (1 < 
i < i) is a node v' in GST{Ti..T() such that L{v) = L{v'). 

It is trivial to show that each internal node v in ST (Ti) has a corresponding 
node v' in GST(Ti..Ti) since GST(Ti..Ti) stores all the suffixes of the strings. 

We define repeats with some properties. A maximal repeat is a repeat that 
cannot be extended to the left or right. For example, in T = A AG T GT GA AG , 
AG is a repeat, but not a maximal one. We can get a maximal repeat A AG 
by adding the immediate left character A. It is obvious that a repeat is either 
maximal or a substring of another maximal one. A supermaximal repeat is a 
maximal repeat that never occurs as a substring of any other maximal repeat. For 
example, in T = G AA GG AA G AA G. AA is a maximal repeat since it appears 
three times in T and GAA and AAG appear only once in T. But it is not a 
supermaximal repeat because another maximal repeat G AA G contains AA. In 
this example, GAAG is a supermaximal repeat of T. Figure 2 shows a general 
relation between a maximal repeat and a supermaximal repeat. 

A A 

' 

T 

- — ' - — ' ' — - 

B B B 

Fig. 2. A is a supermaximal repeat and B is a maximal repeat, but not a supermaximal 
one. 

Lemma 1. A repeat is either supermaximal or a substring of another supermax- 
imal one. 

Proof. We have only to show that a repeat which is maximal but not supermax- 
imal is a substring of another supermaximal one. It follows from the definition 
of supermaximal repeats. 

For an internal node v in ST(T), L(v) is a supermaximal repeat of T if 
and only if all of u’s children are leaves and each leaf has a distinct character 
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immediately to the left of the suffix corresponding to it [5, pages 143-148]. Hence 
the number of supermaximal repeats of T is 0(|r|), and they can be found in 
0(|T|) time. 

3 Algorithm 

Our algorithm for the longest common repeat problem is based on the following 
property. 

Fact 1 Given a set of strings U = {Ti,T 2 , ■ ■ ■ ,T^} , the longest common repeat 
of U is the longest string which is a substring of a supermaximal repeat of each 
string in U. 

The outline of our algorithm for the longest common repeat problem is as 
follows. 

— Step 1: Create a new string T/ for each 1 < i < £ to consider reversed and 
reverse-complemented repeats. 

— Step 2: Build ST{T') for each 1 <i < £. Also, build GST{T[..Tf). 

— Step 3: Find supermaximal repeats of T- for each i in GST{T[..Tf). 

— Step 4: Modify GST(T{..T^) and build the generalized suffix tree of the 
supermaximal repeats. 

Step 5 : Find the longest common repeat among the supermaximal repeats 
using the generalized suffix tree made in Step 4. 

The hard part of the algorithm is Step 4, which changes GST{T[..Tf) into 
the generalized suffix tree of the supermaximal repeats in linear time. 

Step 1 : We first modify each string in U to consider the reversed and reverse- 
complemented repeats. For each i = 1,2,...,^, we create a new string T/ = 
where % and ff are special characters which are not in S. Nor- 
mal repeats of T' include normal, reversed, and reverse-complemented repeats 
of T,. 

Step 2: We build the suffix trees and the generalized suffix tree. For each 

i = l,2,...,t', we build ST(T') with a modification. When we create an in- 
ternal node V, we store an additional information at v. It means that v 

is the lowest common ancestor (LCA) of two leaves representing r/[j..|T/|] and 
T/[j'..|T/|], respectively. If there are more than two leaves in the subtree rooted 
at V, arbitrary two leaves can be chosen. See Figure 3. (We can store (3,1) in- 
stead of (3,2) at the second internal node.) This modification does not change 
the time and space complexities. We also build GST{T{..T^). This procedure 
runs in |T/|) time and space. 

Step 3 : We find supermaximal repeats of each string in U. For each i = 

1,2, we find supermaximal repeats of T/ using ST{T') [5, pages 143-148]. 
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Fig. 3. Modification of the suffix tree construction. 



Now we have a set of nodes Mi = {ti|n is an internal node of ST{TI) and L{v) 
is a supermaximal repeat of T/}. We compute the set of corresponding nodes 
Vi = {v'\v' is an internal node of GST{T[..T'f) and it is the corresponding node 
of n S Mi\. To do so, we use the information obtained during the construction 
of ST{T'). Figure 4 illustrates the idea. We read the information {j,j') at v 
computed in Step 2. Then we compute the LCA v' of two leaves in GST{T{..T^) 
representing T-'[j..|T/|] and T/[/..|T/|], respectively. It is easy to show that v' is 
the corresponding node of v. After |T/|)-time preprocessing, finding a 

corresponding node takes constant time [6, 11,2]. The total time complexity is 

o{T.U\Ti\). 



GST(T’j ..T’l ) 




st(t; ) 




Fig. 4. Finding the corresponding node of v. 



Step 4: Now we explain the hard part of the algorithm, modifying GST{T{..T^) 
into the generalized suffix tree of the supermaximal repeats in linear time. At this 
point we have sets V^’s, where L{v) for each v G Vi {1 < i < i) is a, supermaximal 
repeat of T'. 

The outline of Step 4 is as follows. 

1. For each supermaximal repeat S, insert the suffixes of S'S into GST{T[..T'f). 

2. Identify the nodes of the current tree which should be included in the gen- 
eralized suffix tree of the supermaximal repeats. 

3. Remove the unnecessary nodes and edges of the tree. 
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We first insert the suffixes of the supermaximal repeats into GST{T[..Tg). 
For each i = we traverse GST{T[..T’^) from each element of Vi to 

the root node following the suffix links. Whenever we meet a node previously 
unvisited during the traversal, we create a new leaf node with ID i. These new 
leaves are stored in a set W = is a leaf node of the tree and L{v) is a suffix 
of a supermaximal repeat of T/ followed by $}. We also link the current node and 
the new leaf with an edge labeled by $. The trick is that we stop the traversal 
when we meet a previously visited node and move to the next element of Vi. 
Figure 5 illustrates the idea. Suppose T{ has two supermaximal repeats, GGTG 
and GTG. First we handle GGTG. The visiting order isl— *-2^3^4^5. 
After visiting node 5, which is the root node, we are done with GGTG and we 
handle GTG. After visiting node 6, we visit node 3 again, following the suffix 
link. Then we are done with GTG. This procedure runs in 0(\T-\) time and 





Fig. 5. Inserting the suffixes of GGTC and GTG into GST{T[..T'f) . 

Now we identify the nodes of the tree which should be included in the the 
generalized suffix tree of the supermaximal repeats. To do so, for each i = 
1 , 2 ,...,^, we traverse the tree from each element of W to the root node upward 
and mark the nodes on the path. We stop the traversal if we meet a marked 
node and go on to process the next element of Ni. The generalized suffix tree of 
the supermaximal repeats consists of the marked node and edges linking them 
in the tree. 

Finally, we remove the unnecessary nodes and edges which are not in the 
generalized suffix tree of the supermaximal repeats. We traverse the tree from 
original leaves (not the new leaves created in Step 4) to the root node upward 
and delete the nodes and edges on the path if they are not marked. We move 
to the next original leaf of the tree if we meet a marked node. After deleting 
all the unnecessary nodes and edges, we get the generalized suffix tree of the 
supermaximal repeats. This procedure runs in |T/|) time. 

Step 5 : The remaining problem is to find the longest common substring among 
k supermaximal repeats with distinct IDs with the generalized suffix tree of the 
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supermaximal repeats. Unlike the longest common substring problem, two or 
more supermaximal repeats can have the same ID here. 

Still we can use Hui’s algorithm for the longest common substring problem 
in this case, because it solves a rather general problem [7]. The problem is that 
each leaf of the tree has a color (an ID in our problem) and that we want to 
find the deepest node (in the length of L(y)) whose subtree has leaves with at 
least k colors. We do not mention the details of Hui’s algorithm here. Rather we 
show an example in Figure 6 . Suppose T( has two supermaximal repeats GGTC 
and CTG, T 2 has two supermaximal repeats CTG and TGA, and has two 
supermaximal repeats TGG and ATG. For each internal node of the generalized 
suffix tree of the supermaximal repeats, we compute the number of different IDs 
in its subtrees. The internal nodes with rectangles (nodes 7 and i5) have leaves 
with three different IDs in their subtrees. The internal nodes with circles (nodes 
a, /3, and e) have leaves with two different IDs in their subtrees. For the (3, 3) 
longest common repeat problem, we compare the lengths of L(^) = TG and 
L{5) = G. The answer is TG. For the (2, 3) longest common repeat problem, 
the answer is L(e) = CTG. It runs in |T/|) time, reporting the answer 

of (fc,£) longest common repeat problem for all 1 < A: < ^. 




An internal node which has 1 2 

leaves with three distinct IDs 

An internal node which has 
leaves with two distinct IDs 

Fig. 6. An example of the longest common repeat problem. 



□ 

o 



Theorem 1. The (fc,£) longest eommon repeat problem can be solved in 
0{J2i=i t'l'm-e cind space for all 1 < k < £. 

Proof. We showed that all the steps run in |T/|) time and space. And 

\n\ = om\). 

4 Conclusion 

We have defined the longest common repeat problem and presented a linear time 
algorithm for the problem, allowing reversed and reverse-complemented repeats. 
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A remaining work is to devise a space-efficient algorithm for the longest common 
repeat problem. Another possibility is the longest common approximate repeat 
problem. 
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Abstract. We show how automaton-based sublinear^ keyword pattern 
matching (skpm) algorithms appearing in the literature can be seen as 
different instantiations of a general automaton-based skpm algorithm 
skeleton. Such algorithms use finite automata (FA) for efficient compu- 
tation of string membership in a certain language. The algorithms were 
formally derived as part of a new skpm algorithm taxonomy, based on 
an earlier suffix-based skpm algorithm taxonomy [1]. Such a taxonomy 
is based on deriving the algorithms from a common starting point by 
successively adding algorithm and problem details and has a number of 
advantages. It provides correctness arguments, clarifies the working of 
the algorithms and their interrelationships, helps in implementing the 
algorithms, and may lead to new algorithms being discovered by find- 
ing gaps in the taxonomy. We show how to arrive at the general al- 
gorithm skeleton and derive some instantiations, leading to well-known 
factor- and factor oracle-based algorithms. In doing so, we show the shift 
functions used for them can be (strengthenings of) shift functions used 
for suffix-based algorithms. This also results in a number of previously 
undescribed factor-based skpm algorithm variants, whose performance 
remains to be investigated. 



1 Introduction 

The (exact) keyword pattern matching (kpm) problem can be described as “the 
problem of finding all occurrences of keywords from a given set as substrings 
in a given string” [1]. Watson and Zwaan (in [1], [2, Chapter 4]) derived well- 
known solutions to the problem from a common starting point, factoring out their 
commonalities and presenting them in a common setting to better comprehend 
and compare them. Other overviews of kpm are given in [3,4] and many others. 

Although the original taxonomy contained many skpm algorithms, a new 
category of skpm algorithms - based on factors instead of suffixes of keywords 

^ By sublinear, we mean that the number of symbol comparisons may be sublinear in 
input string length. 
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- has emerged in the last decade. This category includes algorithms such as 
(Set) Backward DAWG Matching [5] and (Set) Backward Oracle Matching [6, 
7], which were added to the existing taxonomy by Cleophas [8]. 

In this paper, we show how suffix, 
factor and factor oracle automaton-based 
skpm algorithms can be seen as instan- 
tiations of a general algorithm skeleton. 

We show how this skeleton is derived by 
successively adding algorithm details to 
a naive, high-level algorithm. Since the 
suffix-based algorithms have been exten- 
sively described in the past [1,2,8], we 
focus our attention on the factor- and fac- 
tor oracle-based algorithms. 

Figure 1 shows the new skpm taxon- 
omy. Nodes represent algorithms, while 
edges are labeled with the detail they rep- 
resent. Most of the details are introduced 
in the course of the text; for those that 
are not, please refer to [9]. 

1.1 Related Work 

The complete original taxonomy is pre- pig. 1. A new automaton-based skpm 

sented in [2, Chapter 4] and [1]. The algorithm taxonomy. 

additions and changes are described in 

Cleophas ’s MSc thesis [8, Chapter 3]. The 

new skpm taxonomy part is completely described in [9] . 

The SPARE Time (String PAttern REcognition) toolkit implements most 
algorithms in the taxonomy. It is discussed in detail in [8, Chapter 5] and will 
be available from http : // www . f astar . org. 

1.2 Taxonomy Construction 

In our case, a taxonomy is a classification according to essential details of al- 
gorithms or data structures from a certain field, taking the form of a (directed 
acylic) taxonomy graph. The construction of a taxonomy has a number of goals: 

— Providing algorithm correctness arguments, often absent in literature 

— Clarifying the algorithms’ working and interrelationships 

— Helping in correctly and easily implementing the algorithms [10,2] 

— Leading to new algorithms, by finding and filling gaps in the taxonomy 

The process of taxonomy construction is preceded by surveying the existing 
literature of algorithms in the problem field. Based on such a survey, one may 
try to bring order to the field by placing the algorithms found in a taxonomy. 
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The various algorithms in a taxonomy are derived from a common starting 
point by adding details indicating the variations between different algorithms. 
The common starting point is a naive algorithm whose correctness is easily 
shown. Associated with this algorithm are requirements in the form of a pre- 
and postcondition, invariant and a specification of (theoretical) running time 
and/or memory usage, specifying the problem under consideration. The details 
distinguishing the algorithms each belong to one of the following categories: 

— Problem details involve minor pre- and postcondition changes, restricting in- 
or output 

— Algorithm details are used to specify variance in algorithmic structure 

— Representation details are used to indicate variance in data structures, in- 
ternally to an algorithm or also influencing in- and output representation. 

— Performance details i.e. running time and memory consumption variance. 

As the representation and performance details mainly influence implementa- 
tion but do not influence the other goals stated above, problem and algorithm 
details are most important. These details form the taxonomy graph edges. 

Taxonomy construction is often done bottom-up, starting with single-node 
taxonomies for each algorithm in the problem domain literature. As one sees 
commonalities among them, one may find generalizations which allow combin- 
ing multiple taxonomies into one larger one with the new generalization as the 
root. Once a complete taxonomy has been constructed, it is presented top-down. 
Associated with the addition of a detail, correctness arguments showing how the 
more detailed algorithm is derived from its predecessor are given. To indicate a 
particular algorithm and form a taxonomy graph, we use the sequence of details 
in order of introduction. Sometimes an algorithm can be derived in multiple 
ways. This causes the taxonomy to take the form of a directed acyclic graph 
instead of a directed tree. 

This type of taxonomy development was also used for garbage collection [1 1] , 
FA construction and minimization [2, 12], graph representations [13] and others. 

1.3 Notation and Definitions Used 

Since a large part of this paper consists of derivations of existing algorithms, 
we will often use notations corresponding to their use in existing literature on 
those algorithms. We use A and B for arbitrary sets, V{A) for the powerset 
of a set A, V for the (non-empty and finite) alphabet and V* for words over 
the alphabet, P = {po,Pi, ■ ■ -P|p|-i} C V* for a finite, non-empty pattern set 
with Iminp = (MINp:pGP: [p]), as well as R for predicates, M for finite 
automata and Q for state sets. States are represented by q and go- Symbols 
a, 6, . . . , e represent alphabet symbols from V , while p,s, . . . z represent words 
over alphabet V. Symbols i,j,...,n represent integer values. We use T (‘bot- 
tom’) to denote an undefined value. Sometimes functions, relations or predicates 
are used that have names longer than just a single character. 

A (deterministic) FA is a 5-tuple M = {Q,V,6,qo, F) where Q is a finite set 
of states, SgQxV^Qis the transition relation, go G Q is a start state and 
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C Q is a set of final states. We extend 6 to 6* G Q x V* ^ Q defined by 
S*{q, e) = q and 6*{q, wa) = S{S*{q, w),a). 

We use for the reversal of a string p, and use string reversal on a set 
of strings as well. A string u is a factor (resp. prefix, suffix) of a string v if 
V = sut (resp. v = ut, v = su). We use functions fact, pref and suff for the 
set of factors, prefixes and suffixes of a (set of) string(s) respectively. We write 
u <p V to denote that m is a prefix of v. The infix operators ) , J , f, ( (pronounced 
‘left take’, ‘left drop’, ‘right take’ and ‘right drop’ respectively) for 0 < fc are 
defined as: w] k is the k min |w| leftmost symbols of w, wj k is the (|w| — /c) max 0 
rightmost symbols of w, w\k is the /cmin|r<;| rightmost symbols of w and w[k 
is the (|w| — A:)maxO leftmost symbols of w. For example, {hers)]3 = her, 
{hers)]l = ers, {hers)\5 = hers and (/lers) (10 = e. 

Our notation for quantifications is introduced in Appendix A. We use pred- 
icate calculus in derivations [14] and present algorithms in an extended version 
of (part of) the guarded command language [15]. In that language, x,y := X,Y 
is used for multiple- variable assignment, while if 6 ^ S' | ^6 ^ T fi represents 
executing S if 6 evaluates to true, and T if ~^b evaluates to true. The extensions 
of the basic language are as 6 — > S sa as a shortcut for if 6 — > S [ ^6 ^ skip 
fi, and for x : R ^ S rof for executing statement list S once for each value of 
X initially satisfying R (assuming there is a finite number of such values for x), 
in arbitrarily chosen order [16]. 

2 An Automaton-Based Algorithm Skeleton 
for Sublinear Keyword Pattern Matching 

In this section, we work towards an automaton-based algorithm skeleton for 
skpm, by adding details to a naive solution. 

The kpm problem, given input string S G V*, and pattern set P, is to 
establish (see Appendix A for our notation for quantifications) 

R : O = ( U l,v,r : Ivr = S A v G P : {{l,v, r)}) 

i.e. to let O be the set of triples forming a splitting of S in three such that the 
middle part is a keyword in P. A trivial (but unrealistic) solution is 

Algorithm 1() 

O := ( U l,v,r : Ivr = S A v G P ■■ {{l,v, r)}) { A } 



The sequence of details describing this algorithm is the empty sequence. We 
may proceed by considering a substring of S' as “suffix of a prefix of S” or as 
“prefix of a suffix of S” . We choose the first possibility as this is the way that the 
algorithms we consider treat substrings of input string S (the second leads to 
algorithms processing S from right to left instead). Applying “examine prefixes 
of a given string in any order” (algorithm detail (p)) to S, we obtain: 
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Algorithm 2(p) 

O := 0; 

for (u, r) : ur = S —> 

O := O U ( U l,v : Iv = u A V G P : r)}) 

rof{ R } 



This algorithm is used in [8, 2] to derive (non-sublinear) prefix-based algorithms 
such as Aho-Corasick, Knuth-Morris-Pratt and Shift- And/-Or. 

The update of O in the repetition of Algorithm 2 can be computed with 
another repetition, considering suffixes of u. Applying “examine suffixes of a 
given string in any order” (algorithm detail (s)) to string u we obtain: 

Algorithm 3(p, s) 

O := 0; 

for (w, r) : ur = S 

for {I, v) : Iv = u —> 

SLS V G P —> O := O U {{l,v,r)} sa 

rof 

rof{ R } 



Algorithm (p, s) consists of two nested non-deterministic repetitions. Each can 
be determinized by considering prefixes (or suffixes as the case is) in increasing 
(called detail (-1-)) or decreasing (detail (— )) order of length. Since the algorithms 
we consider achieve sublinear behaviour by examining string S from left to right, 
and patterns in P from right to left, we focus our attention on: 

Algorithm 4(p+, S+) 



u, r := e, S\ 

e G P -> O ■- {{e,e,S)} \ e ^ P -> O ■- 0 fv, 

{ invariant: ur = S A O = ( [J x,y, z : xyz = S A xy <p u A y G P : {{x,y, z)}^ } 
do r 7^ e ^ 

M, r := u{r] 1), rj 1; I, v := u, e; 
as e e P ^ O := O U {(m, e, r)} sa; 

{ invariant: u = Iv } 
do Z A £ ^ 
l,v- 

as V G P —> O := O U {{l,v, r)} sa 

od 

od{ R } 



To arrive at a more efficient algorithm, we strengthen the inner loop guard 
Z yf e. In [1, 2], this was done by adding cand (Zfl)!) G suff(P) A more general 

^ We use cand (cor) for conditional con- resp. disjunction, i.e. the second operand is 
evaluated if and only if necessary to determine the value of the con- or disjunction. 
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strengthening is possible however. Suppose we have a function f G V{V*) — > 
V{V*) satisfying P C f(P) A suff(f(P)) C f(P) (i.e. f is such that P is included 
in f(P) and f(P) is suffix-closed) then we have (for all w,x G V*) w ^ f(P) 
w ^ P and w ^ f(^’) xw ^ P (application of right conjunct followed by 
left one). We may therefore strengthen the guard to / yf £ cand (/|'l)v G f(P) 
(algorithm detail (gs), for guard strengthening). This leads to: 

Algorithm 5(p+, S+, GS) 



u, r := e, S\ 

if e G P ^ O := {{e,e, S)} I e ^ P ^ O ■- 0 ii-, 

{ invariant: ur = S A O = (^{J x,y,z : xyz = S A xy <p u A y G P : {{x,y, z)}^ } 
do r 7^ e ^ 

u, r := u{r] 1), rj 1; I, v := u, e; 
as e G P ^ O := O U {(it, e, r)} sa; 

{ invariant: u = Iv A v G f(P) } 
do / 7^ e cand (ifl)!; G f(P) ^ 
l,v~ Tl,(/|'l)n; 

SLS V G P ^ O := O U {{l,v, r)} sa 
od{ I = e cor (ifl)!; ^ f(P) } 
od{ R } 



Observe that v G f(P) is now an invariant of the inner repetition, initially es- 
tablished by V := £ (since P 0 and thus £ G f(P)). 

Several choices for f(P) are possible, of which we mention: 

— suff(P), leading to the original taxonomy [1,2]. In [9], those algorithms are 
derived using an automaton-based algorithm skeleton. 

— fact(P), discussed in Section 3. 

— factoracle(P'^)'^, a superset of fact(P'^)'^ (= fact(P)), see Section 4. 

— A function returning a superset of suff. This could be implemented using a 
suffix oracle [6, 7]. We will not explore this option here. 

Direct evaluation of (Zf!)!: G f(P) is expensive and this is where automata come 
in: to efficiently compute this guard conjunct, the transition function of a 

finite automaton recognizing f(P)-^ is used, with the property: 

Property 1 (Transition function of automaton recognizing f{Pffi). The transi- 
tion function i5p_f,p of a deterministic FA M = {Q,V, qo, F) recognizing 

f(P)-^ has the property that p{qo,w^) yf T = G f{P)^. □ 

Property 1 requires pref(f(P)-^) C f(P)-^, i.e. suff(f(P)) C f(P). Also note that 
G f(P)^ = w G f(P). Since we will always refer to the same set P, we will 
use (jp_f for (5p,f,p. Transition function <5p,f can be computed beforehand. 

By making q = Sp^(qo, ((l[l)i;)^) an invariant of the algorithm’s inner rep- 
etition, guard conjunct {l\l)v G f(P) can be changed to g yf T. We call this 
algorithm detail (egc), for efficient guard computation. This algorithm detail 
leads to the following algorithm skeleton: 
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Algorithm 6(p+, S+, GS, EGC) 



u, r := e, S; 

if e G P ^ O := {(£, e, S')} 1 e ^ P ^ O — 0 fi; 

{ invariant: ur = S A O = x,y, z ■. xyz = S A xy <p u A y £ P : {(x,y, z)}^ } 
do r 7^ e ^ 

u,r:=u{r]l),r\l- l,v:=u,e\ q~ Sn^f{qo,l\l)\ 
as e £ P ^ O ~ O yj {{u,e, r)} sa; 

{ invariant: u = Iv A v £ f(P) A q = ^{qo, ((Z|'l)u)^) } 

do Z 7^ e cand q ^ P ^ 
l,v:= l[l, (Z|'l)n; 

<?:= 

as n G P ^ O := O U {(Z, u, r)} sa 
od{ Z = e cor (Z|'l)u ^ f(P) } 
od{ R } 



The particular automaton choices for this detail will be discussed together with 
the corresponding choices for detail (gs) in Sections 3 and 4. Note that guard 
V £ P can be efficiently computed, i.e. computed in 6>(1), by providing a map 
from automaton states to booleans. 

In practice, the algorithms often use automata recognizing f(P')^ where P' = 
{u : u G pref(P) A |u| = Iminp} instead of f(P)'^. Informally, an automaton is 
built on the prefixes of length Iminp, to obtain smaller automata (see [9] for more 
information). To save memory usage, the automata are sometimes constructed 
on-the-fly as well. 

Starting from the above algorithm, we derive an automaton-based skpm al- 
gorithm skeleton. The basic idea is to make shifts of more than one symbol. 
Given k satisfying 1 < Zc < (MIN n : 1 < n A suff(u(r'|n)) n P yf 0 : n), we can 
replace u,r := M(r11),rjl by u,r := u{r]k),r]k. The upperbound on k is the 
distance to the next match, the maximal safe shift distance (mssd) . Any smaller 
k is safe as well, and we thus define a safe shift distance as a shift distance 
k satisfying 1 < fc < (min n : 1 < n a suff(M(r)n)) n P yf 0 : n). The use of 
assignment u,r := M(r'|Zc),rJfc for a safe shift distance k forms algorithm detail 
(ssd). 

Since shift functions may depend on I, v and r, we will write k{l,v,r). We 
aim to approximate the mssd from below, since computing the distance itself 
essentially amounts to solving our original problem. To do this, we weaken the 
predicate suff(M(r1n))nP yf 0. This results in safe shift distances that are easier 
to compute. In the derivation of such weakening steps, the u = Iv A v £ f(P) part 
of the invariant of the inner repetition in Algorithm 6 is used. By adding Z, v := 
e,e to the initial assignments, we turn this into an outer repetition invariant. 
This also turns I = e cor (Zfl)?; ^ f(P) - the negation of the inner repetition 
guard - into an outer repetition invariant. Hence, we arrive at the following 
algorithm skeleton: 
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Algorithm 7(p+, S+, GS, EGG, SSd) 



M, r := e, S\ 

if e G P ^ O := {{s,e, S)} I e ^ P ^ O ■- 0 n- 
I, V := e, e; 

{ invariant: ur = S A O = x,y,z ■. xyz = S A xy <p u A y G P : {{x,y, z)}^ 
A u = Iv A V G f{P) A (Z = e cor ^ f{P)) } 

do r 7 ^ e ^ 

u,r:=u{r]k{l,v,r)),rlk{l,v,r)- l,v:=u,e\ q~ 
as e G P ^ O := O yj {{u,e, r)} sa; 

{ invariant: q = 5^_f((7o, {{l\f)v)^) } 
do I 7^ e cand g 7^ _L ^ 
l,v:= l[l, 

<?:= Sn^f{q,l\l)-, 

as V G P —> O := O U {{l,v,r)} sa 

od 

od{ R } 



Using this algorithm skeleton, various sublinear algorithms may be obtained by 
choosing appropriate f(-P) and function k. For lack of space, we do not consider 
the choice of suff(P) for f(P) in this paper (see [9] instead). 

In [17], an alternative algorithm skeleton for (suffix-based) skpm is presented, 
in which the update to O in the inner loop has been moved out of that loop. 
This requires the use of a precomputed output function, but has the potential 
to substantially reduce the algorithms’ running time. This alternative skeleton 
is not considered in this paper. 



3 Factor-Based Sublinear Pattern Matching 

We now derive a family of algorithms by using the set of factors of P, fact(P). 
We use detail choice (gs=f), i.e. we choose fact(P) for f(P). The inner repetition 
guard then becomes I yf £ cand (/[l)w G fact(P). 

As direct evaluation of (hl)^ G fact(P) is expensive, the transition function 
of an automaton recognizing the set fact(P)^ is used (detail choice (egc=RFA)). 
Using function of Section 2 and making q = an 

invariant of the inner repetition, the guard becomes I yf e cand g yf T. 

Note that various automata exist whose transition functions can be used for 
including the trie built on fact(P)^ and the suffix automaton or the 
dawg (for directed acyclic word graph) on fact(P)-^ [3]. 

The use of detail choices (gs=f) and (egc=RFA) in Algorithm 7 has two 
effects. Firstly, more character comparisons will in general be performed: in cases 
where (^[l)u ^ suff(P) yet (Zfl)^ G fact(P), factor-based algorithms will extend 
V to the left more than strictly necessary. On the other hand, when the guard 
of the inner loop becomes false, (/[l)w ^ fact(P) holds, which gives potentially 
more information to use in the shift function than (/[l)w ^ suff(P). 
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Since (^ri)u ^ fact(P) (^ri)w ^ suff(P), we may use any safe shift func- 
tion derived for suffix-based sublinear algorithms, discussed in [1,2,8]). This 
results in a large number of new algorithms, since most such shift functions 
have not been used with a factor-based algorithm before. In using such a shift 
function, we replace the suff(P) part of their domain by fact(P), meaning that 
precomputation changes. 

3.1 The No- Factor Shift 

We can get potentially larger shifts than by simply using sufRx-based shift func- 
tions, since (?[l)u ^ fact(P) is stronger than (?[l)u ^ suff(P). In [9] we show 
that we may use any shift function satisfying 

(MINn : 1 < n A Weakening{suH{u{r]n)) n P yf 0) : n) . , 

max(l max(/minp — juj)) ' 

The left operand of the outer max corresponds to any suffix-based safe shift 
function, while the right operand corresponds to the shift in case (^[l)u is not a 
factor of a keyword. We introduce shift function kssd,nfs{U v, r) where kssd,nfs G 
V* X fact(P) X F* ^ N is defined by kssd,nfs(l,v,r) = kssd{l,v,r) max (1 
max {Iminp— juj)) for any suffix-based safe shift function kssd (algorithm detail 
(nfs)). We call it the no-factor shift, since it uses (^[l)u ^ fact(P). In particular, 
we may use safe shift distance 1 with the no-factor shift to get shift distance 1 
max (1 max {Iminp — juj)) = 1 max {Iminp — juj). 

This equals the shift distance used in the basic ideas for backward DAWG 
matching [18, page 27] and - combined with algorithm detail (lmin) mentioned 
in Section 2 - set backward DAWG matching [18, page 68] . The actual algorithms 
described in the literature use an improvement based on a property of DAWGs. 
We discuss this in Subsection 3.2. 

The shift function 1 max {Iminp — |u|) just requires precomputation of 
Iminp, yet gives quite large shift distances. This is the reason why factor-based 
skpm algorithms have gotten a lot of attention since their first descriptions in 
literature. The algorithms in literature do not combine it with any of the more 
involved precomputed shift functions as those described in [9, 1, 2]. If precompu- 
tation time is not an important issue however, combining such a shift function 
and the no-factor shift may be advantageous, potentially yielding larger shifts. 
As far as we know, such a combination has not been described or used before. 
Since about ten shift functions are given in [9], the combination of a single one 
with the no-factor shift already gives us about ten new factor-based skpm algo- 
rithms. It remains to be investigated whether these algorithms indeed improve 
over the running time of the algorithms from literature. 

3.2 Cheap Computation of a Particular Shift Function 

We now consider a different weakening of suff(M(r]n)) n P yf 0 in the safe shift 
function predicate. In [9] we show that this leads to shift 1 max{lminp — lastv^p) 
where lasty^p = (MAXm : 0 < m < |u| A u[to G pref(P) : m). 
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It seems to be rather difficult to compute lasty^p. When using a DAWG to 
implement transition function (5fl_fact of algorithm detail (egc=RFA) however, 
we may use a property of this automaton to compute lasty^p ‘on the fly’: the 
final states of the DAWG correspond to suffixes of some G P^, i.e. to prefixes 
of some p G P. Thus, lasty^p equals the length of v at the moment the most 
recent final state was visited. 

We introduce shift function kiskp G fact(P) ^ N defined by kukp = 1 max 
{Iminp — lasty^p). This function does not depend on I and can therefore be seen 
as a variant of the shift function represented by algorithm detail (nla) in [9, 
1]. Galculating the shift distance using kiskp (and variable lasty^p) is algorithm 
detail (lskp) (longest suffix which is keyword prefix). Due to lack of space, the 
complete algorithm is not presented here; please refer to [9,8] for this. 

The algorithm is a variant of Set Backward DAWG Matching [5], [18, page 
68], which adds algorithm detail (lmin). Adding algorithm detail (OKw) results 
in (single-keyword) Backward DAWG Matching. 

Algorithm detail (nfs) is included in neither detail sequence, since the no- 
factor shift can never be larger than the kiskp shift. 

4 Factor Oracle-Based Sublinear Pattern Matching 

We now derive a family of algorithms by using factoracle(P^)^ for f(P). 
We may do so since factoracle(P'^)'^ D fact(P'^)'^ and factoracle is suffix- 
closed [19,6,7]. We strengthen the inner repetition guard, which now becomes 
I yf £ cand (^[l)v G factoracle (P^)^. 

Since direct evaluation of (^[l)v G factoracle(P^)'^ is impossible^, the tran- 
sition function of the factor oracle [19, 6, 7] recognizing the set factoracle(P'^) is 
used. Using function <5factoracie(pR)'‘ and making g = ^factorade(pR)(9o, ((^[l)^')^) 
an invariant of the inner repetition, the guard becomes I yf e cand g yf T. 

The use of detail choices (gs=fo) and (egc=RFO) in Algorithm 7 has two 
important effects. Firstly, the factor oracle recognizing factoracle(P'^) is easier 
to construct and may have less states and transitions than an automaton recog- 
nizing fact(P^) [19,6,7]. On the other hand, even more character comparisons 
may be performed than when using an automaton recognizing fact(P^) (let 
alone when using an automaton recognizing suff(P^)): When (^[l)f ^ fact(P) 
yet (Zfl)?^ G factoracle(P'^)'^, the algorithm will go on extending v to the left 
more than strictly necessary. 

However, (^[l)v ^ factoracle(P'^)^ (Ul)"*^ ^ fact(P^)^ and hence 
(Z[l)p ^ fact(P) and therefore any shift function may be used satisfying Equa- 
tion 1. In particular, both the safe shift functions for the suffix-based algorithms 
as well as the no-factor shift introduced in Section 3 may be used. 

® The language of a factor oracle so far has not been described separately from the 
automaton construction. 

^ Since factoracle(P)^ y^ factoracle(P^) could hold, we cannot use 5 ^ factoracle to 
describe the transition function. We introduce 5factoracle(pR) i the transition func- 
tion of the automaton recognizing factoracle (P^). 
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The Set Backward Oracle Matching algorithm [7], [18, pages 69-72] equals 
our algorithm (p+, s+, GS=FO, egc=RFO, lmin, SSD, NFS, one), while adding 
detail (okw) gives the single keyword Backward Oracle Matching algorithm [6], 
[18, pages 34-36], [19]. 

5 Final Remarks 

We showed how suffix, factor and factor oracle automaton-based sublinear key- 
word pattern matching algorithms appearing in the literature can be seen as 
instantiations of a general automaton-based skpm algorithm skeleton. The algo- 
rithms were formally derived as part of a new taxonomy, presenting correctness 
arguments and clarity on their working and interrelationships. 

We discussed the algorithm skeleton and some instantiations leading to well- 
known algorithms such as (Set) Backward DAWG Matching and (Set) Backward 
Oracle Matching. In addition, we showed the shift functions used for suffix-based 
algorithms to be in principle reusable for factor- and factor oracle-based algo- 
rithms. This results in a number of previously undescribed factor automaton- 
based skpm algorithm variants. Their practical performance remains to be in- 
vestigated and compared to the more basic factor-based skpm algorithms known 
from the literature and described in this paper as well. 

The algorithms described here could also be described using a generalization 
of the alternative Commentz- Walter algorithm skeleton presented in [17], in 
which the output variable update is moved out of a loop to increase performance. 
In addition to changes to the algorithms in the taxonomy to accomodate this 
idea, benchmarking would also need to be performed to study the effects. 

We have not considered precomputation of the various shift functions used 
in the algorithms discussed in this paper. Precomputation of these functions for 
suffix-based algorithms was described in [1], but extending this precomputation 
to factor- and factor oracle-based algorithms remains to be done. 
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A Quantifications 

A basic understanding of the meaning of quantifications is assumed. We use 
the notation (0a : R{a) : E{a)) where © is the associative and commutative 
quantification operator (with unit 60), a is the quantified variable introduced, 
R is the range predicate on a, and E is the quantified expression. By definition, 
we have (©a : false : E{a)) = e^. 

The following table lists some of the most commonly quantified operators, 
their quantified symbols, and their units: 



Operator 


V 


D 


□ 




max 


0 


Symbol 


3 


B 


11 


MIN 


MAX 




Unit 


false 






+00 


— CX) 
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Abstract. Query expansion is a well-known method for improving av- 
erage effectiveness in information retrieval. However, the most effective 
query expansion methods refy on costly retrieval and processing of feed- 
back documents. We explore alternative methods for reducing query- 
evaluation costs, and propose a new method based on keeping a brief 
summary of each document in memory. This method allows query expan- 
sion to proceed three times faster than previously, while approximating 
the effectiveness of standard expansion. 



1 Introduction 

Standard ranking techniques in information retrieval return documents that con- 
tain the same terms as the query. While the insistence on exact vocabulary 
matching is often effective, identification of some relevant documents involves 
finding alternative query terms. Previous work has shown that through query 
expansion (QE) effectiveness is often significantly improved (Rocchio, 1971, 
Robertson and Walker, 1999, Carpineto et al., 2001). 

Local analysis has been found to be one of the most effective methods for 
expanding queries (Xu and Croft, 2000). For those methods the original query is 
used to determine top-ranked documents from which expansion terms are sub- 
sequently extracted. A major drawback of such methods is the need to retrieve 
those documents during query evaluation, greatly increasing costs. In other work 
(Billerbeck et ah, 2003), we explored the use of surrogates built from past queries 
as a cheap source of expansion terms, but such surrogates require large query 
logs to be usable. 

In this paper, we identify the factors that contribute to the cost of query ex- 
pansion, and explore in principle the alternatives for reducing these costs. Many 
of these approaches compromise effectiveness so severely that they are not of 
practical benefit. However, one approach is consistently effective: use of brief 
summaries - a pool of the most important terms - of each document. These 
surrogates are much smaller than the source documents, and can be rapidly pro- 
cessed during expansion. In experiments with several test sets, we show that our 
approach reduces the time needed to expand and evaluate a query by a factor of 
three, while approximately maintaining effectiveness compared to standard QE. 



A. Apostolico and M. Melucci (Eds.): SPIRE 2004, LNCS 3246, pp. 30—42, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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2 Background 



Relevance feedback is used to refine a query using knowledge of whether docu- 
ments retrieved by this query are relevant. Weighted terms from judged docu- 
ments are added to the original query, where they act as positive and negative 
examples of the terms that should occur in relevant and non-relevant documents. 
The modified query is then reissued, in the hope of ranking the remaining rele- 
vant documents more highly (Rocchio, 1971, Ruthven and Laimas, 2003). Inter- 
active QE can significantly increase effectiveness (Magennis and van Rijsbergen, 
1997), although on average - for non expert users - automatic expansion is more 
likely to lead to better performance (Ruthven, 2003). 

In automatic QE, also called pseudo relevance feedback, the query is aug- 
mented with expansion terms from highly-ranked documents (Robertson and 
Walker, 1999). An alternative (Qiu and Frei, 1993, Gauch and Wang, 1997) is to 
examine the document collection ahead of time and construct similarity thesauri 
to be accessed at query time. The use of thesauri in general has been shown to 
be less successful than automatic QE (Mandala et ah, 1999), though the two 
approaches can be successfully combined (Xu and Croft, 2000). 

An effective method for QE, used throughout this paper, is based on the 
Okapi BM25 measure (Robertson and Walker, 1999, Robertson et ah, 1992). 
Slightly modified, this measure is as follows: 



bm25{q, d) = E log 

tGq 



/ N-ft + 0.5 \ 

V /i + 0.5 J 



(fci -I- l)fd,t 
K + fd,t 



where terms t appear in query g; the collection contains N documents d; ft docu- 
ments contain a particular term and a particular document contains a particular 
term fd,t times; K is fci((l — b)+bx Ld/AL); constants ki and b respectively are 
set to 1.2 and 0.75; and Ld and AL are measurements in a suitable unit for the 
document length and average document length respectively. The modifications 
to the original formulation (see Sparck-Jones et al. (2000) for a detailed expla- 
nation) is the omission of a component that deals with repeated query terms. In 
the queries we use, term repetitions are rare. 

In this paper we use the expansion method proposed by Robertson and 
Walker (1999) where E terms with the lowest term selection value are chosen 
from the top R ranked documents: 



TSVt = 




where a term t is contained in rt of the top ranked R documents. The expansion 
terms get added to the original query, but instead of using their Okapi value, 
their weight (Robertson and Walker, 1999) is chosen by the formula^: 

^ The factor of ^ was recommended by unpublished correspondence with the authors. 
It de-emphasises expansion terms and prevents query drift, that is, “alteration of 
the focus of a search topic caused by improper expansion” (Mitra et al., 1998). We 
confirmed in unpublished experiments that the value of the factor is suitable. 
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(rt + 0.5)/(i?-rt + 0.5) 
n + 0.5)/{N- ft-R + n + 0.5) 



We have shown previously that best choices of R and E depend on the collection 
used and should in principle be carefully optimised (Billerbeck and Zobel, 2004); 
to reduce the complexity of the experiments, in this paper we use the standard 
values of i? = 10 and E = 25. 

Although there has been a great deal of research on efficient evaluation of 
ranked queries (Witten et ah, 1999, pages 207-210), there is no prior work on 
efficient QE for text retrieval, the focus of this paper. 



3 Query Expansion Practicalities 

In most expansion methods making use of local analysis, there are five key stages. 
First, the original query is used to rank an initial set of documents. This set is 
then retrieved from disk and all terms are extracted from those documents. 
Terms are evaluated and ranked in order of their potential contribution to the 
query. The top ranked terms are appended to the query, and finally the refor- 
mulated query is reissued and a final set of documents is ranked. 

Each phase of the ranking process has scope for efficiency gains, but some 
of the gains involve heuristics that can compromise effectiveness. In this section 
we explore these options; this exploration provides a focus for the experiments 
reported later in this paper. Some of the concepts introduced here - in particular, 
associations and surrogates - are described in more detail in the next section. 

Initial Ranking. During the first stage, documents are ranked according to the 
original query. For each query term the inverted list is retrieved, if it hasn’t 
been cached, and processed. For each document referenced in the list, a score is 
calculated and added to a list of scores that is kept for (say) 20,000 documents 
(Moffat and Zobel, 1996). Once all query terms have been processed, the top R 
documents are used for the next stage. 

The cost of accessing an inverted list depends on the disk access time. For a 
long list, the costs are directly proportional to list size. If the list is organised 
by document identifier, the whole list must be fetched for each query term. 

A way of reducing the cost of retrieving and processing the inverted lists is 
to cut down the volume of list information that has to be retrieved. This has 
been achieved by, for example, Anh and Moffat (2002), where documents are 
not stored in the order they are encountered during indexing, but in order of 
the impact a term has in a particular document. For instance, a term has more 
impact in a document in which it occurs twice, than another of the same length 
in which it occurs once. Using this ordering means that either the processing 
of lists can be stopped once a threshold is reached, or that the lists are capped 
to begin with, leading to lower storage requirements, reduced seek times, and 
allowing more lists to be cached in memory. We have not used impacts in our 
experiments, but the gains that they provide are likely to be in addition to the 
gains that we achieve with our methods. 
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Another way to reduce list length, discussed in more detail later, is to index 
only a fraction of the document collection for the initial ranking. Initial ranking 
is traditionally on the document collection, but there is no particular reason why 
other collections should not be used. Another option, also explored later, of this 
kind is to use document surrogates. A drawback of these approaches is that the 
full index still needs to be available for the final ranking and thus is loaded at 
the same time as auxiliary indexes. This means that some of the advantage of 
using shorter lists is negated by having less space available to cache them. 



Fetching Documents. Having identified the highly ranked documents, these need 
to be fetched. In the vast majority of cases these documents are not cached from 
a previous expansion or retrieval process (assuming a typical memory size), and 
therefore have to be fetched from disk, at a delay of a few milliseconds each. 

Traditionally, full-text documents are fetched. This is the most expensive 
stage of expansion and therefore the area where the greatest gains are available. 
We have shown previously that surrogates - which are a fraction of the size of the 
documents - can be more effective than full-text documents (Billerbeck et ah, 
2003). Using surrogates such as query associations is more efficient, provided 
that those surrogates can be pre-computed, as discussed later. 

Another approach is limiting the number of documents available for extrac- 
tion of terms, which should result in higher efficiency, due to reduced cache misses 
when retrieving the remaining documents and otherwise smaller seek times as 
it can be expected that the limited number of documents are clustered on disk. 
Documents could be chosen by, for example, discarding those that are the least 
often accessed over a large number of queries (Garcia et ah, 2004). 

A more radical measure is to use in-memory document surrogates that pro- 
vide a sufficiently large pool of expansion terms, as described in the following 
section. If such a collection can be made sufficiently small, the total cost of ex- 
pansion can be greatly reduced. Typically full text document collections don’t fit 
into main memory, but well-constructed surrogates may be only a small fraction 
of the size of the original collection. Our surrogates are designed to be as small 
as possible while maintaining effectiveness. 



Extracting Candidate Terms. Next, candidate terms (that is, potential expan- 
sion terms) are extracted from the fetched documents. These documents need to 
be parsed, and terms need to be stopped. (We do not use stemming, since in un- 
published experiments we have found that stemming does not make a significant 
difference to effectiveness.) 

This phase largely depends on the previous phase; if full text documents 
have been fetched, these need to be parsed and terms need to be stopped. In the 
case of query associations, the surrogates are pre-parsed and pre-stopped and 
extraction is therefore much more efficient. 

The in-memory surrogates we propose can be based on pointers rather than 
the full terms in memory. The pointers reference terms in the dictionary used 
for finding and identifying statistics and inverted lists. They have a constant size 
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(4 bytes) and are typically smaller than a vocabulary term. This approach also 
eliminates the lookups needed in the next stage. 

Selecting Expansion Terms. The information (such as the inverse document 
frequency) necessary for calculation of a term’s TSV is held in the vocabulary, 
which may be held on disk or (as in our implementation) in memory; even when 
held on disk, the frequency of access to the vocabulary means that typically 
much of it is cached. As a result, this phase is the fastest and can only be sped 
up by providing fewer candidate terms for selection. 

Query associations typically consist of 20-50 terms, as opposed to the average 
of 200 or more for web documents. Use of surrogates could make this stage several 
times more efficient than the standard approach. Surrogates are a strict subset 
of full text documents, and usually are a tiny fraction thereof, ensuring that 
selection is efficient. 

Final Ranking. Finally the document collection is ranked against the reformu- 
lated query. Similar considerations as in the first phase are applicable here. We 
have shown previously (Billerbeck et al., 2003) that final ranking against surro- 
gates is, unsurprisingly, ineffective. The only option for efficiency gains at this 
stage is to use an approach such as impact-ordering, as discussed earlier. 

4 Methods of Increasing Efficiency for QE 

In the previous section we identified costs and plausible approaches for reducing 
them. In this section, we consider the most promising methods in more detail, 
setting a framework for experiments. In particular, we propose the novel strategy 
of using bag-of-word summaries as a source of expansion terms. 

Query Associations. Query associations (Scholer and Williams, 2002) capture 
the topic of a document by associating past user queries with the documents that 
have been highly ranked by that query. We have previously shown (Billerbeck 
et al., 2003) that associations are effective when useful query logs are available. 
A disadvantage of using associations is that an extra index needs to be loaded 
and referenced during query evaluation. However, this penalty is small, as asso- 
ciations are likely to be a small fraction of collection size. The advantages are 
that associations are usually pre-stemmed and stopped, stored in a parsed form, 
and cheap to retrieve. 

Rather than indexing the associations, it would be possible in principle to 
rank using the standard index, then fetch and expand from the associations, but 
in our earlier work (Billerbeck et al., 2003) we found that it was necessary to 
rank against the associations themselves. 

Reducing Collection Size for Sourcing Expansion Terms. The intuition under- 
lying expansion is that, in a large collection, there should be multiple documents 
on the same topic as the query, and that these should have other pertinent terms. 
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However, there is no logical reason why the whole collection should have to be 
accessed to identify such documents. Plausibly, documents sampled at random 
from the collection should represent the overall collection in respect of the ter- 
minology used. In our experiments, we sampled the collection by choosing every 
nth document, for n of 2 and 4. Other options would be to use centroid clusters 
or other forms of representative chosen on the basis of semantics. Documents 
could also be stored in a pre-parsed format (such as a forward index), which we 
have not tested. 

In-Memory Document Summaries. The major bottleneck of local analysis is 
the reliance on the highly ranked documents for useful expansion terms. These 
documents typically need to be retrieved from disk. We propose that summaries 
of all documents be kept in memory, or in a small auxiliary database that is 
likely to remain cached. A wide range of document summarisation techniques 
have been investigated (Goldstein et ah, 1999), and in particular Lam-Adesina 
and Jones (2001) have used summarisation for QE. In this work, representative 
sentences are selected, giving an abbreviated human-readable document. 

However, summaries to be used for QE are not for human consumption. We 
propose instead that the summaries consist of the terms with the highest tf.idf 
values, that is, the terms that the expansion process should rank highest as 
candidates if given the whole document. To choose terms, we use the function: 

tf.idf = log X log (1 -k fd,t) 

where N is the number of documents in the collection, ft of which contain term t, 
and fdy is the number of occurrences of t in document d. 

Given these values, we can then build summaries in two ways. One is to 
have a fixed number S of highly-ranked terms per document. The other is to 
choose a global threshold C, in which case each summary consists of all the 
document terms whose tf.idf value exceeds C. Instead of representing summaries 
as sequences of terms, it is straightforward to instead use lists of pointers to the 
vocabulary representation of the term, reducing storage costs and providing rapid 
access to any statistics needed for the TSV. During querying, all terms in the 
surrogates that have been ranked against the original query are then used for 
selection. This not only avoids long disk I/Os, but also the original documents 
- typically stored only in their raw form - do not need to be parsed. S' or (7 can 
be chosen depending on collection size or available memory. 

Although it is likely that query-biased summaries (Tombros and Sanderson, 
1998) - as provided in most contemporary web search engines - would be more 
effective (Lam-Adesina and Jones, 2001), such a method cannot be applied in 
the context of efficient QE, as query-biased summaries cannot be precomputed. 

Other Approaches. Since the original query terms effectively get processed twice 
during the ranking process, it seems logical to only process the original query 
terms during the initial ranking, and then, later, process the expansion terms 
without clearing the accumulator table that was used for the initial ranking. 
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However, as explored previously (Moffat and Zobel, 1996), limiting the num- 
ber of accumulators aids efficiency and effectiveness. To support this strategy, 
query terms must be sorted by their inverse document frequency before the query 
is processed. Because most expansion terms have a high inverse document fre- 
quency - that is, they appear in few documents and are relatively rare - it is 
important that they be processed before most of the original query terms, which 
typically have lower values. (The effect is similar - albeit weaker - to that of im- 
pact ordered indexes as discussed previously. ) This means that the original query 
must be processed again with the expansion terms for final ranking. Intuition 
suggests that this argument is incorrect, and the original query terms should be 
allowed to choose the documents; however, in preliminary experiments we found 
that it was essential to process the original terms a second time. Processing only 
expansion terms in the second phase reduced costs, but led to poor effectiveness. 

Other strategies could also lead to reduced costs. Only some documents, per- 
haps chosen by frequency of access (Garcia et ah, 2004) or sampling, might be 
included in the set of surrogates. A second tier of surrogates could be stored on 
disk, for retrieval in cases where the highly-ranked documents are not amongst 
those selected by sampling. Any strategy could be further improved by com- 
pressing the in-memory surrogates, for example with d-gapping (Witten et ah, 
1999, page 115) and a variable-byte compression scheme (Scholer et al., 2002). 

Note that our summaries have no contextual or structural information, and 
therefore cannot be used - without major modifications - in conjunction with 
methods using such information, such as the local context analysis method of 
Xu and Croft (2000) or the summarisation method of Goldstein et al. (1999). 

5 Experiments 

Evaluating these approaches to QE requires that we test whether the heuristics 
degrade effectiveness, and whether they lead to reduced query evaluation time. 
To ensure that the time measurements were realistic, we used Lucy^ as the 
underlying search engine. 

The test data is drawn from the TREC conferences (Harman, 1995). We used 
two collections. The first was of newswire data, from TREC 7 and 8. The second 
was the WTlOg collection, consisting of 10 gigabytes of web data crawled in 
1997 (Bailey et ah, 2003) for TREC 9 and 10. Each of these collections has two 
sets of 50 topics and accompanying relevance judgements. As queries, we used 
the title field from each TREC topic. We use the Wilcoxon signed rank test to 
evaluate the significance of the effectiveness results (Zobel, 1998). 

For timings, we used 10,000 stopped queries taken from two query logs col- 
lected for the Excite search engine (Spink et al., 2002); these are web queries 
and thus are suitable for the WTlOg runs. Since we were not able to obtain 
appropriate query logs for the newswire data, we used the same 10,000 queries 

^ Lucy/Zettair is an open source search engine being developed at RMIT by the Search 
Engine Group. The primary aim in developing Lucy is to test techniques for efficient 
information retrieval. Lucy is available from http://www.seg.rmit.edu.au/. 
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Table 1. Performance of expansion techniques of TREC queries on the TREC newswire 
and WTlOg collections, for TREC 8 and TREC 10 queries. Effectiveness results shown 
are average precision (AvP), precision at 10 (P@10), and R-Precision (R-P). Also shown 
is the average query time over 10,000 queries and the amount of overhead memory re- 
quired for each method; “index” marks the need to refer to an auxiliary index during 
expansion. A f marks results that are significantly different to the baseline of no ex- 
pansion at the 0.10 level, and J at the level of 0.05. S is the number of summary terms 
used, and C specifies the cutoff threshold for the selection value. 
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Method 


Time 

(ms) 
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R-P 


Mem 

(MB) 
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None 
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n/a 
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0.288t 
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index 
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0.275t 


index 


8 


Quarterl 


167 
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S = 1 


46 


0.231t 


0.446 


0.267J 


6 


8 


S' = 10 


54 


0.238t 


0.438 


0.271t 


24 


8 


S = 25 


59 


0.244t 


0.456 


0.277t 


54 


8 


S = 40 


61 


0.245t 


0.452 


0.275t 


83 


8 


O 

to 

II 


64 


0.243t 


0.454 


0.281t 


102 


8 


S = 100 


72 


0.240t 


0.450 


0.282J 


183 


8 


C = 1.0 


58 


0.243t 


0.448 


0.280t 


56 


10 


None 


62 


0.163 


0.290 


0.190 


n/a 


10 


Standard 


615 


0.180 


0.288 


0.202 


n/a 


10 


Assoc. 


835 


0.180 


0.272f 


0.209 


index 


10 


S = 1 


139 


0.138 


0.218 


0.150 


19 


10 


S = 10 


177 


0.1531 


0.227 


0.169 


76 


10 


S = 25 


202 


0.156 


0.224 


0.170 


166 


10 


S = 28 


204 


0.185 


0.308 


0.217f 


183 


10 


O 

to 

11 


221 


0.156 


0.224 


0.170 


296 


10 


o 

o 

II 


245 


0.156 


0.224 


0.170 


296 


10 


C = 1.0 


217 


0.185t 


0.312f 


0.2131 


190 



for this collection. The machine used for our timings is a dual Intel Pentium III 
866 MHz with 768 MB of main memory running Fedora Core I. 



Results 

We used the TREC 8 and TREC 10 query sets to explore the methods. Results 
for this exploration are shown in Table 1. We applied the best methods found in 
Table 1 to the TREC 7 and TREC 9 query sets, as shown in Table 2. The tables 
detail the collection, the method of expansion, average precision, precision at 
10, and r-precision values, as well as auxiliary memory required. A second index 
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Table 2. As in Table 1, but showing results only for the methods that worked best on 
TREC 8 and TREC 10. 



TREC 


Expansion 

Method 


Time 

(ms) 


AvP 


P@10 


R-P 


Mem 

(MB) 


7 


None 


23 


0.191 


0.456 


0.248 


n/a 


7 


Standard 


211 


0.232t 


0.452 


0.286t 


n/a 


7 


S' = 40 


61 


0.220t 


0.426f 


0.279t 


83 


7 


C = 1.0 


58 


0.215t 


0.426f 


0.272J 


56 


9 


None 


62 


0.193 


0.267 


0.223 


n/a 


9 


Standard 


615 


0.177 


0.260 


0.200 


n/a 


9 


S = 28 


204 


0.161 


0.269 


0.176 


183 


9 


C = 1.0 


217 


0.162 


0.256 


0.169t 


190 



is needed for the runs where associations or fractional collections are used for 
initial ranking and candidate term extraction. 

For TREC 8 and to a lesser extent TREC 10, standard QE improves over 
the baseline, but in both cases query evaluation takes around nine times as long. 
Several of the methods proposed do not succeed in our aims. Associations take as 
long as standard QE, and effectiveness is reduced. For TREC 8 the surrogates 
are arguably inappropriate, as the web queries may not be pertinent to the 
newswire data; however, this issue highlights the fact that without a query log 
associations cannot be used. 

Using halves (n = 2) or quarters (n = 4) of the collection also reduces 
effectiveness, and has little impact on expansion time; this is due to the need to 
load and access a second index. Larger n led to smaller improvements in QE; in 
experiments with n = 8, not reported here, QE gave no improvements. Reducing 
R to roughly a quarter of its original size in order to cater for a smaller number of 
relevant documents - as intuition might suggest - only further degrades results. 
This is consistent with previous work which shows that retrievel effectiveness 
especially in the top ranked documents is greater for larger collections than sub- 
collections (Hawking and Robertson, 2003) which means that there is a higher 
likelihood of sourcing expansion terms from relevant documents when using local 
analysis QE. It was also found that QE works best when expansion terms are 
sourced from collections that are a superset of documents of the one targeted 
(Kwok and Chan, 1998). 

However, our simple tf.idf summaries work well. Even one-word {S = 1) sum- 
maries yield significantly improved average precision on TREC 8, for a memory 
overhead of a few megabytes. The best cases were S' = 40 on TREC 8 and S = 28 
on TREC 10, where processing costs were only a third those of standard QE. 
These gains are similar to those achieved by (Lam-Adesina and Jones, 2001) 
with summaries of 6-9 sentences each, but our summaries are considerably more 
compact, showing the advantage of a form of summary intended only for QE. 
While the memory overheads are non-trivial - over 180 megabytes for TREC 10 
- they are well within the capacity of a small desktop machine. 

Results on TREC 7 for the summaries are equally satisfactory, with good 
effectiveness and low overheads. Results on TREC 9 are, however, disappoint- 
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Fig. 1. Varying average precision and associated memory cost with the number and 
cutoff value of summary terms respectively. Using the TREC 8 collection and queries. 




Fig. 2. As in previous figure, but using the TREC 10 collection and queries. 



ing. We had already discovered that expansion on TREC 9 does not improve 
effectiveness (Billerbeck and Zobel, 2004); our results here are, in that light, 
unsurprising. The principal observation is that QE based on summaries is still 
of similar effectiveness to that based on full documents. 

We show only one value for the cutoff threshold, C = 1.0. This leads to 
the same effectiveness for similar memory overhead. Summaries and choice of 

5 and C are further examined in Figures 1 and 2 for newswire and web data 
respectively. These show that a wide range of S values (left figure) and C values 
(right figure) lead to improved effectiveness, in some cases exceeding that of 
standard QE. 

6 Conclusions 

We have identified the main costs of query expansion and, for each stage of the 
query evaluation process, considered options for reducing costs. Guided by pre- 
liminary experiments, we explored two options in detail: expansion via reduced- 





40 



Bodo Billerbeck and Justin Zobel 



size collections and expansion via document surrogates. Two forms of surrogates 
were considered: query associations, consisting of queries for which each docu- 
ment was highly ranked, and tf.idf summaries. 

The most successful method was the tf.idf summaries. These are much 
smaller than the original collections, yet are able to provide effectiveness close 
to that of standard QE. The size reduction and simple representation means 
that they can be rapidly processed. Of the two methods for building summaries, 
slightly better performance was obtained with those consisting of terms whose 
selection value exceeded a global threshold. The key to the success of this method 
is that it eliminates several costs: there is no need to fetch documents after the 
initial phase of list processing, and selection and extraction of candidate terms 
is trivial. 

Many of the methods we explored were unsuccessful. Associations can yield 
good effectiveness if a log is available, but are expensive to process. Reduced-size 
collections yielded no benefits; it is possible that choosing documents on a more 
principled basis would lead to different effectiveness outcomes, but the costs 
are unlikely to be reduced. Streamlining list processing by carrying accumulator 
information from one stage to the next led to a collapse in effectiveness. Our 
tf.idf summaries, in contrast, maintain the effectiveness of QE while reducing 
time by a factor of three. 
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Abstract. The prediction of query performance is an interesting and 
important issue in Information Retrieval (IR). Current predictors in- 
volve the use of relevance scores, which are time-consuming to compute. 
Therefore, current predictors are not very suitable for practical applica- 
tions. In this paper, we study a set of predictors of query performance, 
which can be generated prior to the retrieval process. The linear and 
non-parametric correlations of the predictors with query performance 
are thoroughly assessed on the TREC disk! and diskS (minus CR) col- 
lections. According to the results, some of the proposed predictors have 
significant correlation with query performance, showing that these pre- 
dictors can be useful to infer query performance in practical applications. 



1 Introduction 

Robustness is an important measure reflecting the retrieval performance of an IR 
system. It particularly refers to how an IR system deals with poorly-performing 
queries. As stressed by Cronen-Townsend et. al. [4], poorly-performing queries 
considerably hurt the effectiveness of an IR system. Indeed, this issue has become 
important in IR research. For example, in 2003, TREC proposed a new track, 
namely the Robust Track, which aims to investigate the retrieval performance 
of poorly-performing queries. Moreover, the use of reliable query performance 
predictors is a step towards determining for each query the most optimal cor- 
responding retrieval strategy. For example, in [2], the use of query performance 
predictors allowed to devise a selective decision methodology avoiding the failure 
of query expansion. 

In order to predict the performance of a query, the first step is to differentiate 
the highly-performing queries from the poorly-performing queries. This problem 
has recently been the focus of an increasing research attention. 

In [4], Cronen-Townsend et. al. suggested that query performance is corre- 
lated with the clarity of a query. Following this idea, they used a clarity score 
as the predictor of query performance. In their work, the clarity score is de- 
fined as the Kullback-Leibler divergence of the query model from the collection 
model. In [2], Amati et. al. proposed the notion of query- difficulty to predict 
query performance. Their basic idea is that the query expansion weight, which 
is the divergence of the query terms’ distribution in the top-retrieved documents 
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from their distribution in the whole collection, provides evidence of the query 
performance. 

Both methods mentioned above select a feature of a query as the predic- 
tor, and estimate the correlation of the predictor with the query performance. 
However, it is difficult to incorporate these methods into practical applications 
because they are post-retrieval approaches, involving the time-consuming com- 
putation of relevance scores. 

In this paper, we study a set of predictors that can be computed before the 
retrieval process takes place. The retrieval process refers to the process where 
the IR system looks through the inverted files for the query terms and assigns 
a relevance score to each retrieved document. The experimental results show 
that some of the proposed predictors have significant correlation with query 
performance. Therefore, these predictors can be applied in practical applications. 

The remainder of this paper is organised as follows. Section 2 proposes a 
set of predictors of query performance. Sections 3 and 4 study the linear and 
non-parametric correlations of the predictors with average precision. Section 5 
presents a smoothing method for improving the most effective proposed predictor 
and the obtained results. Finally, Section 6 concludes this work and suggests 
further research directions. 

2 Predictors of Query Performance 

In this section, we propose a list of predictors of query performance. Similar 
to previous works mentioned in Section 1, we consider the intrinsic statistical 
features of queries as the predictors and use them in inferring the query per- 
formance. Moreover, these features should be computed prior to the retrieval 
process. The proposed list of predictors is inspired by previous works related 
to probabilistic IR models, including the language modelling approach [11] and 
Amati & van Rijsbergen’s Divergence From Randomness (DFR) models [3]: 

— Query length. According to Zhai & Lafferty’s work [15], in the language 
modelling approach, the query length has a strong effect on the smoothing 
methods. In our previous work, we also found that the query length heavily 
affects the length normalisation methods of the probabilistic models [7]. 

For example, the optimal setting for the so-called normalisation 2 in Amati & 
van Rijsbergen’s probabilistic framework is query-dependent [3] . The empir- 
ically obtained setting of its parameter c is c = 7 for short queries and c = 1 
for long queries, suggesting that the optimal setting depends on the query 
length. Therefore, the query length could be an important characteristic of 
the queries. In this paper, we define the query length as: 

Definition 1 (ql): The query length is the number of non-stop words in the 
query. 

— The distribution of informative amount in query terms. In general, 
each term can be associated with an inverse document frequency (idf(t)) 
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describing the informative amount that a term t carries. As stressed by 
Pirkola and Jarvelin, the difference between the resolution power of the query 
terms, which is given as the idf{t) values, could affect the effectiveness of 
the retrieval performance [9]. Therefore, the distribution of the idf{t) factors 
in the composing query terms might be an intrinsic feature that affects the 
retrieval performance. In this paper, we investigate the following two possible 
definitions for the distribution of informative amount in query terms: 

Definition 2 ( 7 I): Given a query Q, the distribution of informative amount 
in its composing terms, called 7 I, is represented as: 



71 = 



( 1 ) 



where aidf is the standard deviation of the idf of the terms in Q. 
For idf, we use the INQUERY’s idf formula [1]: 



idf{t) 



log^{N + 0.5) /Nt 

log2(A^+ 1) 



(2) 



where Nt is the number of documents in which the query term t appears and 
N is the number of documents in the whole collection. 

Another possible definition representing the distribution of informative 
amount in the query terms is: 

Definition 3 (72): Given a query Q, the distribution of informative amount 
in its composing terms, called 72 , is represented as: 

r, idfmax 
min 

where idfmax and idfmin are the maximum and minimum idf among the 
terms in Q respectively. 

The idf of Definition 3 is also given by the INQUERY’s idf formula. 

— Query clarity. Query clarity refers to the speciality/ ambiguity of a query. 
According to the work by Cronen-Townsend et. al. [4], the clarity (or on the 
contrary, the ambiguity) of a query is an intrinsic feature of a query, which 
has an important impact on the system performance. Cronen-Townsend et. 
al. proposed the clarity score of a query to measure the coherence of the 
language usage in documents, whose models are likely to generate the query 
[4]. In their definition, the clarity of a query is the sum of the Kullback- 
Leibler divergence of the query model from the collection model. However, 
this definition involves the computation of relevance scores for the query 
model, which is time-consuming. In this paper, we simplify the clarity score 
by proposing the following definition: 

Definition 4 (SCS): The simplified query clarity score is given by: 



SCS = ^ PmlHQ) ■ 

n Pcollfw) 



(4) 
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In the above definition, Pmi{w\Q) is given by It is the maximum like- 
lihood of the query model of the term w in query Q. qtf is the number of 
occurrences of a query term in the query and ql is the query length. Pcou{w) 
is the collection model, which is given by , where tfcoii is the number 

of occurrences of a query term in the whole collection and tokericoii is the 
number of tokens in the whole collection. 

Although the above definition seems simple and naive, it would be very 
easy to compute. In Sections 3 and 4, we will show that this simplified 
definition has significant linear and non-parametric correlations with query 
performance. Moreover, in Section 5, the proposed simplified clarity score is 
improved by smoothing the query model. 

— Query scope. Similar to the clarity score, an alternative indication of the 
generality/speciality of a query is the size of the document set containing at 
least one of the query terms. As stressed in [10], the size of this document 
set is an important property of the query. Following [10], in this work, we 
define the query scope as follows: 

Definition 5 (w): The query scope is: 

uj = -\og{riQ/N) (5) 

where nq is the number of documents containing at least one of the query 
terms, and N is the number of documents in the whole collection. 

In the following sections, we will study the correlations of the predictors with 
query performance. In order to fully investigate the predictors, we check both 
linear and non-parametric dependance of the predictors with query performance. 
The latter is a commonly used measure for the query performance predictors, 
since the distribution of the involved variables are usually unknown. On the 
contrary, the linear dependance assumes a linear distribution of the involved 
variables. Although this strong assumption is not always true, the linear fitting 
of the variables can be straightforwardly applied in practical applications. 

3 The Linear Dependence Between the Predictors 
and Average Precision 

In this section, we measure the linear correlation r of each predictor with the 
actual query performance, and the p- value associated to this correlation [5]. We 
use average precision (AP) as the focus measure representing the query perfor- 
mance in all our experiments. Again, note that the linear correlation assumes a 
linear distribution of the involved variables, which is not always true. 

The correlation r varies within [-1, 1]. It indicates the linear dependence be- 
tween the two pairs of variables. A value of r = 0 indicates that the two variables 
are independent, r > 0 and r < 0 indicates that the correlation between the two 
variables is positive and negative, respectively. The p-value is the probability 
of randomly getting a correlation as large as the observed value, when the true 
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correlation is zero. If p-value is small, usually less than 0.05, then the corre- 
lation is significant. A significant correlation of a predictor with AP indicates 
that this predictor could be useful to infer the query performance in practical 
applications. 



3.1 Test Data and Settings 

The document collection used to test the efficiency of the proposed predictors is 
the TREC disk4&5 test collections (minus the Congressional Record on disk!) . 
The test queries are the TREC topics 351-450, which are used in the TREC7&8 
ad-hoc tasks. For all the documents and queries, the stop- words are removed 
using a standard list and the Porter’s stemming algorithm is applied. 

Each query consists of three fields, i.e. Title, Description and Narrative. In 
our experiments, we define three types of queries with respect to the different 
combinations of these three fields: 

— Short query: Only the titles are used. 

— Normal query: Only the descriptions are used. 

~ Long query: All the three fields are used. 

The statistics of the length of the three types of queries are provided in 
Table 1. We run experiments for the three types of queries to check the impact 
of the query type on the effectiveness of the predictors, including the query 
length. 

In the experiments of this section, given the AP value of each query, we 
compute r and the corresponding p-value of the linear dependance between the 
two variables, i.e. AP and each of the predictors. The AP values of the test 
queries are given by the PL2 and BM25 term weighting models, respectively. We 
use two statistically different models in order to check if the effectiveness of the 
predictors is independent of the used term- weighting models. 

PL2 is one of the Divergence From Randomness (DFR) term weighting mod- 
els developed within Amati & van Rijsbergen’s probabilistic framework for IR 
[3] . Using the PL2 model, the relevance score of a document d for query term t 
is given by: 

w{t, d) = tf ■ log 2 y + (^ + ^2-tf ~ ® ■ ^ 082(2 • tf) ■ y-y( 6 ) 

where A is the mean and variance of a Poisson distribution. 

The within document term frequency tf is then normalised using the nor- 
malisation 2: 



t/n = t/-log 2 (l-kc- ^^^),(c>0) (7) 

where I is the document length and avgJ is the average document length in 
the whole collection. 
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Table 1. The statistics of the length of the three types of queries, avg^ql is the average 
query length. Var{ql) is the variance of the length of the queries 





Short Query Normal Query Long Query 


avgjql 

Var{ql) 


2.42 7.55 21.13 

0.42 10.19 55.77 



Table 2. The settings of the free parameters for different types of queries 



Parameter 


Short Query Normal Query Long Query 


c of PL2 
b of BM25 


5.90 1.61 1.73 

0.09 0.25 0.64 



Replacing the raw term frequency tf by the normalised term frequency tfn in 
Equation (6), we obtain the final weight, c is a free parameter. It is automatically 
estimated by measuring the normalisation effect [7]. The first row of Table 2 
provides the applied c value for the three types of queries. 

As one of the most well-established IR systems, Okapi uses BM25 to measure 
the term weight, where the idf factor is normalised as follows [12]: 



w{t, d) 



(i) (fei + 1)^/ (^3 + 

^ K + tf ks + qtf 



(8) 



where w is the final weight. K is given by fci((l — b) + b ^J'^ ^ ), where I and avgJ 
are the document length and the average document length in the collection, 
respectively. For the parameters ki and k^, we use the standard setting of [14], 
i.e. ki = 1.2 and fcs = 1000. qtf is the number of occurrences of a given term in 
the query and tf is the within document frequency of the given term, b is the 
free parameter of BM25’s term frequency normalisation component. Similar to 
the parameter c of the normalisation 2, it is estimated by the method provided 
in [7]. However, due to the “out of range” problem mentioned in [7], we applied 
a new formula for the normalisation effect (see Appendix). The second row of 
Table 2 provides the applied b values in all reported experiments. 



3.2 Discussion of Results 

In Table 3, we summarise the results of the linear correlations of the predictors 
with AP. From the results, we could derive the following observations: 

— Query length (see Definition 1) does not have a significant linear correlation 
with AP. This might be due to the fact that the length of queries of the same 
type are very similar (see Var(ql) in Table 1). To check the assumption, we 
computed the correlation of AP with the length of a mixture of three types 
of queries. Thus, we had 100 x 3 = 300 observations of both AP and query 
length. Measuring the correlation, we obtained r = 0.0585 and a p-value of 
0.3124, which again indicates a very low correlation. Therefore, query length 
seems to be very weakly correlated with AP. 
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Table 3. The correlations r of the predictors with AP, and the related p-values. The 
results are given separately with respect to the three types of queries. Significant cor- 
relations are shown in bold. The test queries are the topics used in TREC7&8 





PL2, Short Query 


BM25, Short Query 




ql 7 I 72 (jj SCS 


ql 7 I 72 ui SCS 


r 

p- value 


-0.1839 0.2398 0.0569 0.3772 0.4484 
0.0670 0.0163 0.5738 0.0001 3.037e-6 


-0.1773 0.1860 0.0332 0.3746 0.4208 
0.0776 0.0639 0.7430 0.0001 1.351e-5 




PL2, Normal Query 


BM25, Normal Query 




ql 7 I 72 iij SCS 


ql 7 I 72 UJ SCS 


r 

p- value 


0.0830 0.3017 0.1259 0.1895 0.2602 
0.4116 0.0023 0.2120 0.0590 0.0089 


0.0876 0.2946 0.1436 0.1629 0.2293 
0.3862 0.0029 0.1542 0.1054 0.0217 




PL2, Long Query 


BM25, Long Query 




ql 7 I 72 (jj SCS 


ql 7 I 72 UJ SCS 


r 

p- value 


0.0543 0.3227 0.3029 0.0910 0.2401 

0.5915 0.0011 0.0022 0.3679 0.0161 


0.0790 0.2822 0.2753 0.0843 0.2066 

0.4349 0.0044 0.0056 0.4044 0.0392 



— 7 I (see Definition 2) has significant linear correlation with AP in all cases 
except for the short queries when BM25 is used. It is also interesting to see 
that the correlations for normal and long queries are stronger than that for 
short queries. 

— The linear correlation of q2 (see Definition 3) with AP is only significant for 
long queries. Also, the correlation is positive, which indicates that a larger 
gap of informative amount between the query terms would result into a 
higher AP. Moreover, the results show that on the used test collection, 7 I is 
more effective than 72 in inferring query performance. 

— For uj, the query scope (see Definition 4), its linear correlation with AP is 
only significant for short queries. Perhaps this is because when queries are 
getting longer, the query scope tends to be stable. Figure 1 supports this 
assumption. We can see that the tu of normal and long queries are clearly 
more stable than those of short queries. 

— The simplified clarity score (SCS, see Definition 5) has significant linear 
correlation with AP in all circumstances. For the short queries, the use of 
PL2 results in the highest linear correlation among all the predictors (the 
linear fitting is given in Figure 2). However, when the query length increases, 
the correlation gets weaker. 

— Moreover, it seems that the predictors are generally less effective when BM25 
is used as the term-weighting model. For the same predictor, the AP given 
by BM25 is usually less correlated with it than the AP given by PL2. 

In summary, query type has a strong impact on the effectiveness of the pre- 
dictors. Indeed, the correlation of a predictor with AP varies for diverse query 
types. For short queries, SCS and uj have strong linear correlations with AP. For 
normal queries, 7 I has moderately significant linear correlation with AP. For 
long queries, 7 I and 72 have significant linear correlations with AP. 

In general, among the five proposed predictors, SCS is the most effective one 
for short queries, and 7 I is the most effective one for normal and long queries. 
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Fig. 1. The ranked oj values in ascending Fig. 2. The linear correlation of SCS with 
order for the three types of queries AP using PL2 for short queries 



For all the three types of queries, 7I is more effective than y2 in inferring query 
performance. Moreover, since u> was proposed for Web IR [10] and SCS is more 
effective than u, SCS could also be a good option for Web IR. Note that, although 
some previous works found that query length affects the retrieval performance 
[7, 15], it seems that query length is not significantly correlated with AP, at least 
on the used collection. 

Finally, we found that, in most cases, the predictors are slightly less cor- 
related with the AP obtained using BM25 than that obtained using PL2. The 
difference of correlations is usually marginal, except for short queries, where 7I 
is significantly correlated with the AP obtained using PL2, but not BM25. Over- 
all, the use of different term-weighting models does not considerably affect the 
correlations of the proposed predictors with AP. 

4 Non-parametric Correlation of the Predictors 
with Average Precision 

In this section, instead of the linear correlation, we check the non-parametric 
correlations of the predictors with AP. An appropriate measure for the non- 
parametric test is the Spearman’s rank correlation [6]. In this paper, we denote 
the Spearman’s correlation between variables X and Y as rs{X,Y). 

The test data and experimental setting for checking the Spearman’s correla- 
tion are the same as the previous section. As shown in Table 4, the results are 
very similar to the linear correlations provided in Table 3. SCS is again the most 
effective predictor, which has significant Spearman’s correlations with AP for 
the three types of queries. Also, 7I seems to be the most effective predictor for 
normal and long queries. Moreover, the predictors are generally slightly less cor- 
related with the AP obtained using BM25 than that obtained using PL2. Again, 
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Table 4. The Spearman’s correlation rs of the predictors with AP for three types 
queries using PL2 and BM25 respectively. Significant correlations are shown in bold. 
The test queries are the topics used in TREC7&8 





PL2, Short Query 


BM25, Short Query 




ql 7 I 72 ui SCS 


ql 7 I 72 u! SCS 


rs 

p- value 


-0.0476 0.2141 0.0279 0.3627 0.4236 
0.6359 0.0331 0.7794 0.0003 2.504e-5 


-0.0354 0.1449 -0.0217 0.3393 0.3752 
0.7243 0.1497 0.8280 0.0007 0.0002 




PL2, Normal Query 


BM25, Normal Query 




ql 7 I 72 u) SCS 


ql 7 I 72 u! SCS 


rs 

p- value 


-0.0646 0.3627 0.1240 0.1790 0.2721 
0.5203 0.0003 0.2183 0.0748 0.0068 


-0.0640 0.3439 0.1129 0.1647 0.2583 
0.5242 0.0006 0.2615 0.1013 0.0102 




PL2, Long Query 


BM25, Long Query 




ql 7 I 72 ui SCS 


ql 7 I 72 u! SCS 


rs 

p- value 


0.0132 0.3272 0.2236 0.1324 0.2668 
0.8958 0.0011 0.0266 0.1861 0.0079 


-2.1e-05 0.2972 0.1875 0.1544 0.2556 
0.9998 0.0030 0.0628 0.1238 0.0110 



the difference of correlations is usually marginal, except the correlation of 7I 
with short queries, where AP) for PL2 is significant, while AP) for 

BM25 is not. Finally, 7I is still more effective than 72 as a query performance 
predictor. 

We also compare rs{SCS,AP) with the rs{CS,AP) for the TREC7&8 and 
TREC4 ad-hoc tasks reported in [4]. CS stands for Cronen-Townsend et. al.’s 
clarity score. To do the comparison, besides rs{SCS, AP) for TREC7&8 pro- 
vided in Table 4, we also run experiments checking the rs{SCS, AP) values for 
the queries used in TREC4. The test queries for TREC4 are the TREC top- 
ics 201-250, which are normal queries as they only consist of the descriptions. 
There was no experiment for long queries reported in [4]. The parameter c of 
the normalisation 2 (see Equation (7)) is also automatically set to 1.64 in our 
experiments for TREC4. 

Regarding the generation of AP, Cronen-Townsend et. al. apply Song & 
Croft’s multinomial language model for CS [13], and we apply PL2 for SCS. 
Since rs{SCS,AP) is stable for statistically diverse term-weighting models, i.e. 
PL2 and BM25 (see Table 4), we believe that the use of the two different term- 
weighting models won’t considerably affect the comparison. 

Table 5 compares rs{SCS, AP) with the rs{CS, AP) reported in [4]. We can 
see that for normal queries, rs{CS, AP) is clearly higher than rs{SCS, AP). 
However, for short queries, although rs{CS,AP) is larger than rs{SCS, AP), 
the latter is still a significant high correlation. 

In summary, SCS is effective in inferring the performance of short queries. 
Since the actual queries on the World Wide Web are usually very short, SCS 
can be useful for Web IR, or for other environments where queries are usu- 
ally short. Moreover, SCS is very practical as the cost of its computation is 
indeed insignificant. However, comparing with CS, SCS seems to be moderately 
weak in inferring the performance of longer queries, including normal queries, 
although the obtained rs{SCS, AP) values are still significant according to the 
corresponding p- values. 
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Table 5. The Spearman’s correlations of clarity score (CS) and SCS with AP. For SCS 
and CS, AP is obtained using PL2 and Song & Croft’s multinomial language model, 
respectively. For TREC7&8, the queries are of short type. For TREC4, the queries are 
of normal type as they only consist of descriptions. The data in the first row are taken 
from [4] 





TREC7&8 Short Query 


|TREC4 Normal Query 




rs 


p- value 


rs 


p- value 


CS 


0.536 


4.8e-8 


0.490 


3.0e-4 


SCS 


0.424 


2.5e-5 


0.252 


0.0779 



The moderately weak correlations of SCS with AP for longer queries might 
be due to the fact that the maximum likelihood of the query model {Pmi{w\Q)) is 
not reliable when the query length increases. As mentioned before, the effective- 
ness of those predictors, which are positively correlated with the query length, 
decreases as the query gets longer. Therefore, we might be able to increase the 
correlation by smoothing the query model, which is directly related to the query 
length. We will discuss this issue in the next section. 

5 Smoothing the Query Model of SCS 

In this section, we present a method for smoothing the query model of SCS. For 
the estimation of the query model P(w\Q), instead of introducing the document 
model by a total probability formula [4], we model the qtf density of query 
length ql directly, so that the computation of SCS does not involve the use of 
relevance scores. Note that qtf is the frequency of the term in the query Q. 

Let us start with assuming an increasing qtf density of query length ql, then 
we would have the following density function: 

p=C-ql^ (9) 

where p is the density and (7 is a constant of the density function. The expo- 
nential j3 should be larger than 0. An appropriate value is j3 = 0.5. 

Let the average query length be the interval of the integral of p, we then have 
the following smoothing function: 

pql-\-avg^ql 

qtfn= / pd{ql) = V ■ {{ql + avg-qlY'^ — ql^'^) (10) 

Jql 

where qtfn is the smoothed qtf . Replacing qtf with qtfn in Definition 4, we 
will obtain the smoothed query model, avg.ql is the average query length, ly is 
a free parameter. It is empirically set in our experiments (see the third column 
of Table 6). 

Table 6 summarises the obtained rs{SCS,AP) values using the smoothing 
function. For short queries, no significant effect is noticed. However, for normal 
and long queries, the rs values are considerably larger than the values obtained 
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Table 6. The Spearman’s correlation of SCS with AP for different types of qneries 
using the smoothing function. AP is obtained using PL2 



Task 


Query Type 


V 


rs 


p- value 


TREC7&8 


Short 


e-5 


0.4268 


2.471e-5 


TREC7&8 


Normal 


2.5e-4 


0.3017 


0.0027 


TREC7&8 


Long 


2.5e-4 


0.3002 


0.0028 


TREC4 


Normal 


5e-5 


0.2847 


0.0463 



without the use of the smoothing function (see Table 4). It is also encouraging 
to see that for TREC4, compared to the rs value in Table 5, the obtained rs 
value using the smoothing function is significant. Therefore, the effectiveness of 
SCS has improved for normal and long queries by smoothing the query model. 

6 Conclusions and Future Work 

We have studied a set of pre-retrieval predictors for query performance. The 
predictors can be generated before the retrieval process takes place, which is 
more practical than current approaches to query performance prediction. We 
have measured the linear and non-parametric correlations of the predictors with 
AP. According to the results, the query type has an important impact on the 
effectiveness of the predictors. Among the five proposed predictors, a simplified 
definition of clarity score (SCS) has the strongest correlation with AP for short 
queries. 7I is the most correlated with AP for normal and long queries. Also, we 
have shown that SCS can be improved by smoothing the query model. Taking 
the complexity of generating a predictor into consideration, SCS and 7I can be 
useful for practical applications. Moreover, according to the results, the use of 
two statistically diverse term-weighting models does not have an impact on the 
overall effectiveness of the proposed predictors. 

In the future, we will investigate improving the predictors using various 
methods. For example, we plan to develop a better smoothing function for 
the query model of SCS. We will also incorporate the proposed predictors into 
our query clustering mechanism, which has been applied to select the optimal 
term- weighting model, given a particular query [8]. The use of better predic- 
tors would hopefully allow the query clustering mechanism to be improved. As a 
consequence, the query-dependence problem of the term frequency normalisation 
parameter tuning, stressed in [7], could be overcome. 

Acknowledgments 

This work is funded by the Leverhulme Trust, grant number F/00179/S. The 
project funds the development of the Smooth project, which investigates the 
term frequency normalisation (URL: http://ir.dcs.gla.ac.uk/smooth). The ex- 
perimental part of this paper has been conducted using the Terrier framework 
(FPSRC, grant GR/R90543/01, URL: http://ir.dcs.gla.ac.uk/terrier). We would 
also like to thank Gianni Amati for his helpful comments on the paper. 




54 



Ben He and ladh Ounis 



References 

1. J. Allan, L. Ballesteros, J. Callan, W. Croft. Recent experiments with INQUERY. 
In Proceedings of TREC-4, pp. 49-63, Gaithersburg, MD, 1995. 

2. G. Amati, C. Carpineto, G. Romano. Query difficulty, robustness, and selective ap- 
plication of query expansion. In Proceedings of ECIR’04, pp. 127-137, Sunderland 
UK, 2004. 

3. G. Amati and C. J. van Rijsbergen. Probabilistic models of information retrieval 
based on measuring the divergence from randomness. In TOIS, 20(4), pp. 357-389, 
2002 . 

4. S. Cronen- Townsend, Y. Zhou, W. B. Croft. Predicting query performance. In 
Proceedings of SIGIR’02, pp. 299-306, Tampere, Finland, 2002. 

5. M. DeGroot. Probability and Statistics. Addison Wesley, 2nd edition, 1989. 

6. J. D. Gibbons and S. Ghakraborti. Nonparametric statistical inference. New York, 
M. Dekker, 1992. 

7. B. He and I. Ounis. A study of parameter tuning for term frequency normalization. 
In Proceedings of CIKM’03, pp. 10-16, New Orleans, LA, 2003. 

8. B. He and I. Ounis. A query-based pre-retrieval model selection approach to infor- 
mation retrieval. In Proceedings of R1AO’04, pp. 706-719, Avignon, France, 2004. 

9. A. Pirkola and K. Jarvelin. Employing the resolution power of search keys. JASIST, 
52(7):575-583, 2001. 

10. V. Plachouras, I. Ounis, G. Amati, C. J. van Rijsbergen. University of Glasgow at 
the Web Track: Dynamic application of hyperlink analysis using the query scope. 
In Proceedings of TREC2003, pp. 248-254, Gaithersburg, MD, 2003. 

11. J. M. Ponte and W. B. Croft. A language modeling approach to information re- 
trieval. In Proceedings of SIGIR’98, pp. 275-281, Melbourne, Australia, 1998. 

12. S. Robertson, S. Walker, M. M. Beaulieu, M. Gatford, A. Payne. Okapi at TREC-4. 
In Proceedings of TREG-4, pp. 73-96, Gaithersburg, MD, 1995. 

13. F. Song and W. Croft. A general language model for information retrieval. In Pro- 
ceedings of SIGIR’99, pp. 279-280, Berkeley, CA, 1999. 

14. K. Sparck-Jones, S. Walker, S. Robertson. A probabilistic model of information re- 
trieval: Development and comparative experiments. IPM, 36(2000) :779-840, 2000. 

15. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied 
to ad hoc information retrieval. In Proceedings of SIGIR’Ol, pp. 334-342, New 
Orleans, LA, 2001. 



Appendix 



The new formula for the normalisation effect NEu is the following: 
NED = Var{ ),d,GD 

^d,max 



( 11 ) 



where D is the set of documents containing at least one of the query terms, di 
is a document in D. NEd^max is the maximum NEd^ in D. Var denotes the 
variance. NEd^ is given by: 



1 

(l-^)+^-iI7b 



(12) 



where I is the length of the document di. 6 is a free parameter of BM25. avgJ 
is the average document length in the whole collection. 
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Abstract. Documents are co-derivative if they share content: for two 
docnments to be co-derived, some portion of one must be derived from 
the other or some portion of both must be derived from a third document. 
The current technique for concurrently detecting all co-derivatives in a 
collection is document fingerprinting, which matches documents based 
on the hash values of selected document subsequences, or chunks. Fin- 
gerprinting is currently hampered by an inability to accurately isolate 
information that is useful in identifying co-derivatives. In this paper we 
present SPEX, a novel hash-based algorithm for extracting duplicated 
chunks from a document collection. We discuss how information about 
shared chunks can be used for efficiently and reliably identifying co- 
derivative clusters, and describe DECO, a prototype system that makes 
use of SPEX. Our experiments with several document collections demon- 
strate the effectiveness of the approach. 



1 Introduction 

Many document collections contain sets of documents that are co-derived. Exam- 
ples of co-derived documents include plagiarised documents, document revisions, 
and digests or abstracts. Knowledge of co-derivative document relationships in a 
collection can be used for returning more informative results from search engines, 
detecting plagiarism, and managing document versioning in an enterprise. 

Depending on the application, we may wish to identify all pairs of co-derived 
documents in a given collection (the n x n or discovery problem) or only those 
documents that are co-derived with a specified query document (the 1 x n or 
search problem). We focus in this research on the more difficult discovery prob- 
lem. While it is possible to naively solve the discovery problem by repeated 
application of an algorithm designed for solving the search problem, this quickly 
becomes far too time-consuming for practical use. 

All current feasible techniques for solving the discovery problem are based on 
document fingerprinting, in which a compact representation of a selected sub- 
set of contiguous text chunks occurring in each document - its fingerprint - is 
stored. Pairs of documents are identified as possibly co-derived if enough of the 
chunks in their respective fingerprints match. Fingerprinting schemes differenti- 
ate themselves largely on the way in which chunks to be stored are selected. 
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In this paper we introduce SPEX, a novel and efficient algorithm for identi- 
fying those chunks that occur more than once within a collection. We present 
the DECO package, which uses the shared phrase indexes generated by SPEX as 
the basis for accurate and efficient identification of co-derivative documents in a 
collection. We believe that deco effectively addresses some of the deficiencies of 
existing approaches to this problem. Using several collections, we experimentally 
demonstrate that DECO is able to reliably and accurately identify co-derivative 
documents within a collection while using fewer resources than previous tech- 
niques of similar capability. We also have data to suggest that DECO should scale 
well to very large collections. 

2 Co-derivatives and the Discovery Problem 

We consider two documents to be co-derived if some portion of one document 
is derived from the other, or some portion that is present in both documents 
is derived from a third document. Broder (1997) defines two measures of co- 
derivation - resemblance and containment - in terms of the number of shingles 
(we shall use the term chunks) a pair of documents have in common. A chunk 
is defined by Broder as ‘a contiguous subsequence’; that is, each chunk repre- 
sents a contiguous set of words or characters within the document. An example 
chunk of length six taken from this document would be ‘each chunk represents 
a contiguous set’. The intuition is that, if a pair of documents share a number 
of such chunks, then they are unlikely to have been created independently. Such 
an intuition is what drives fingerprinting-based approaches, described later. 

We can conceptualise the co-derivation relationships within a collection as 
a graph, with each node representing a single document and the presence or 
absence of an edge between two nodes representing the presence or absence of 
a co-derivation relationship between the documents represented by those nodes. 
We call this the relationship graph of the collection. The task of the discovery 
problem is to discover the structure of this graph. Note that, as the number of 
edges in a graph is quadratic in the number of nodes, the task of discovering the 
structure of the relationship graph is a formidable one: for example, a collection 
of 100,000 documents contains nearly 5 billion unique document pairings. 

3 Strategies for Co-derivative Discovery 

There are several approaches to solving the search problem, in particular fin- 
gerprinting systems and ranking-based systems. Ranking-based systems such as 
relative frequency matching (Shivakumar & Garcia-Molina 1995) and the iden- 
tity measure (Hoad & Zobel 2003) make use of document statistics such as the 
relative frequency of words between documents to give a score for how likely a 
pair of documents is to be co-derived. In comparisons between such methods and 
fingerprinting, the ranking-based methods tended to perform better, though it is 
worth noting that the comparisons were carried out by the proponents of these 
systems. However, the only computationally feasible algorithms for the discovery 
problem to date have used the process of document fingerprinting. 
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3.1 Fingerprinting 

The key observation underlying document fingerprinting (Manber 1994, Brin et 
al. 1995, Heintze 1996, Broder et al. 1997, Hoad & Zobel 2003) mirrors that be- 
hind the definitions of Broder (1997): if documents are broken down into small 
contiguous chunks, then co-derivative documents are likely to have a large num- 
ber of these chunks in common, whereas independently derived documents with 
overwhelming probability will not. Fingerprinting algorithms store a selection of 
chunks from each document in a compact form and flag documents as potentially 
co-derived if they have some common chunks in their fingerprints. 

While fingerprinting algorithms vary in many details, their basic process 
is as follows: documents in a collection are parsed into units (typically either 
characters or individual words); representative chunks of contiguous units are 
selected through the use of a heuristic; the selected chunks are then hashed for 
efficient retrieval and/or compact storage; the hash-keys, and possibly also the 
chunks themselves, are then stored, often in an inverted index structure (Witten 
et al. 1999). The index of hash-keys contains all the fingerprints for a document 
collection and can be used for the detection of co-derivatives. 

The principal way in which document fingerprinting algorithms differentiate 
themselves is in the choice of selection heuristic, that is, the method of determin- 
ing which chunks should be selected for storage in each document’s fingerprint. 
The range of such heuristics is diverse, as reviewed by Hoad & Zobel (2003). 
The simplest strategies are full selection, in which every chunk is selected, and 
random selection, where a given proportion or number of chunks is selected at 
random from each document to act as a fingerprint. Other strategies pick every 
nth chunk, or only pick chunks that are rare across the collection (Heintze 1996). 
Taking a different approach is the anchor strategy (Manber 1994), in which 
chunks are only selected if they begin with certain pre-specified combinations 
of letters. Simpler but arguably as effective is the modulo heuristic, in which 
a chunk is only selected if its hash-key modulo a parameter k is equal to zero. 
The winnowing algorithm of Schleimer et al. (2003) passes a window over the 
collection and selects the chunk with the lowest hash-key in each window. Both 
the anchor and modulo heuristics ensure a level of synchronisation between fin- 
gerprints in different documents, in that if a particular chunk is selected in one 
document, it will be selected in all documents. 

In their comparative experiments, Hoad & Zobel (2003) found that few of the 
fingerprinting strategies tested could reliably identify co-derivative documents 
in a collection. Of those that could, Manber’s anchor heuristic was the most 
effective, but its performance was inferior to their ranking-based identity measure 
system. Similarly, Shivakumar & Garcia-Molina (1995) found that the COPS 
fingerprinting system (Brin et al. 1995) was far more likely than their SCAM 
ranking-based system to fail to identify co-derivative documents. 

Several techniques use fingerprinting for the discovery problem: 

Manber (1994) counts the number of identical postings lists in the chunk 
index, arguing this can be used to identify clusters of co-derived documents in 
the collection. However, as Manber points out, there are many cases in which 
the results produced by his method can be extremely difficult to interpret. 
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Broder et al. (1997) describe an approach in which each postings list is broken 
down to a set of document-pair tokens, one for each possible pairing in the 
list. The number of tokens for each pair of documents is counted and used as 
the basis for a set of discovery results. While this approach can yield far more 
informative results, taking the Cartesian product of each postings list means 
that the number of tokens generated is quadratic in the length of the postings 
list; this can easily cause resource blowouts and introduces serious scalability 
problems for the algorithm. 

Shivakumar & Garcia-Molina (1999) addressed the scalability problems of the 
previous algorithm by introducing a hash-based probabilistic counting technique. 
For each document pair, instead of storing a token, a counter in a hashtable is 
incremented. A second pass generates a list of candidate pairs by discarding 
any pair that hashes to a counter that recorded insufficient hits. Assuming the 
hashtable is of sufficient size, this pruning significantly reduces the number of 
tokens that must be generated for the exact counting phase. 

A fundamental weakness of fingerprinting strategies is that they cannot iden- 
tify and discard chunks that do not contribute towards the identification of any 
co-derivative pairs. Unique chunks form the vast majority in most collections, 
yet do not contribute toward solving the discovery problem. We analysed the 
LATimes newswire collection (see section 6) and found that out of a total of 
67,808,917 chunks of length eight, only 2,816,822 were in fact instances of du- 
plicate chunks: less than 4.5% of the overall collection. The number of distinct 
duplicated chunks is 907,981, or less than 1.5% of the collection total. 

The inability to discard unused data makes full fingerprinting too expensive 
for most practical purposes. Thus, it becomes necessary to use chunk-selection 
heuristics to keep storage requirements at a reasonable level. However, this in- 
troduces lossiness to the algorithm: current selection heuristics are unable to 
discriminate between chunks that suggest co-derivation between documents in 
the collection and those that do not. There is a significant possibility that two 
documents sharing a large portion of text are passed over entirely. 

For example, Manber (1994), uses character-level granularity and the modulo 
selection heuristic with k = 256 Thus, any chunk has an unbiased one-in-256 
chance of being stored. Consider a pair of documents that share an identical 1 
KB (1024 byte) portion of text. On average, four of the chunks shared by these 
documents will be selected. Using the Poisson distribution with A = 4, we can 
estimate the likelihood that C chunks are selected as P{C = 0) = • 4*^/0! = 

1.8% and P{C = 1) = • 4^/1! = 7.3%. This means that a pair of documents 

containing a full kilobyte of identical text have nearly a 2% chance of not having 
a single hash- key in common in their fingerprints, and a greater than 7% chance 
of only one hash key in common. The same results obtain for an identical 100- 
word sequence with a word- level chunking technique and k = 25, as used by 
Broder et al. (1997). Such lossiness is unacceptable in many applications. 

Schleimer et al. (2003) make the observation that the modulo heuristic pro- 
vides no guarantee of storing a shared chunk no matter how long the match. 
Whatever the match length, there is a nonzero probability that it will be over- 
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looked. Their winnowing selection heuristic is able to guarantee that any contigu- 
ous run of shared text greater than a user-specifiable size w will register at least 
one identical hash-key in the fingerprints of the documents in question. However, 
a document that contains fragmented duplication below the level of w can still 
escape detection by this scheme: it is still fundamentally a lossy algorithm. 

3.2 Algorithms for Lossless Fingerprinting 

We make the observation that as only chunks that occur in more than one 
document contribute towards identifying co-derivation, a selection strategy that 
selected all such chunks would provide functional equivalence to full fingerprint- 
ing, but at a fraction of the storage cost for most collections. The challenge is 
to find a way of efficiently and scalably discriminating between duplicate and 
unique chunks. 

Hierarchical dictionary-based compression techniques like SEQUITUR (Nevill- 
Manning & Witten 1997) and re-pair (Larsson & Moffat 2000) are primarily 
designed to eliminate redundancy by replacing strings that occur more than 
once in the data with a reference to an entry in a ruleset. Thus, passages of 
text that occur multiple times in the collection are identified as part of the 
compression process. This has been used as the basis for phrase-based collection 
browsing tools such as phind (Nevill-Manning et al. 1997) and re-STORE (Moffat 
& Wan 2001). However, the use of these techniques in most situations is ruled 
out by their high memory requirements: the phind technique needs about twice 
the memory of the total size of the collection being browsed (Nevill-Manning et 
al. 1997). To keep memory use at reasonable levels, the input data is generally 
segmented and compressed block- by-block; however, this negates the ability of 
the algorithm to identify globally duplicated passages. Thus, such algorithms 
are not useful for collections of significant size. 

Suffix trees are another potential technique for duplicate-chunk identification, 
and are used in this way in computational biology (Gusfield 1997). However, the 
suffix tree is an in-memory data structure that consumes a quantity of memory 
equal to several times the size of the entire collection. Thus, this technique is 
also only suitable for small collections. 

4 The SPEX Algorithm 

Our novel hash-based SPEX algorithm for duplicate-chunk extraction has much 
more modest and flexible memory requirements than the above and is thus the 
first selection algorithm that is able to provide lossless chunk selection within 
large collections. The fundamental observation behind the operation of SPEX is 
that if any subchunk of a given chunk can be shown to be unique, then the 
chunk in its entirety must be unique. For example, if the chunk ‘quick brown’ 
occurs only once in the collection, there is no possibility that the chunk ‘quick 
brown fox’ is repeated. Spex uses an iterated hashing approach to discard unique 
chunks and leave only those that are very likely to be duplicates. 

The basic mechanics of the algorithm are shown in Algorithm 1. At the core 
of SPEX is a pair of hashcounters -- hashtable accumulator arrays - designed to 
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Algorithm 1 The SPEX algorithm. 

1: // C: Collection of chunks 
2: III'. Target chunk length 
3: ! I Cn'. chunk of length n 

4: // Cn{p ■ ■ ■ q}'- The chunk composed of words p through q of chunk Cn 

5: // #(c): The hash value of chunk c 

6: / / hn'- Hashcounter for chunks of length n 

7: 

8: for all ci € C do 
9: /ii[#(ci)] ^ hi[#(ci)] + 1 

10: end for 
11: for n £ [2, /] do 
12: for all c„ G C do 

13: if hn-i[#(cTi{l . . . n — 1})] > 1 and hn-i[#(c„{2 . . . n})] > 1 then 

14: ^n[#(c„)] <— h„[#(c„)] + 1 

15: end if 

16: end for 

17: end for 



count string occurrences. Each time a string is inserted into a hashcounter, it is 
hashed and a counter at that location is incremented. Collisions are not resolved. 
For the purposes of the SPEX algorithm, we care about only three counter values: 
0, 1 and ‘2 or more’. As such, each field in the hashcounter need be only two 
bits wide. If the same string is inserted into a hashcounter more than once, the 
hashcounter will indicate this. The hashcounter can also return false positives, 
indicating a string occurs multiple times when it in fact does not. A small number 
of such false positives can be tolerated by SPEX; the number can be kept small 
because the two-bit wide fields allow for extremely large hashcounters to reside 
in a relatively modest amount of memory 

When a document collection is presented to SPEX, the first step is to sequen- 
tially scan the collection and insert each word encountered into a hashcounter. 
This hashcounter thus indicates (with the possibility of false positives) whether 
a word occurs multiple times in the collection. Following this, we pass a sliding 
window of size two words over the collection. Each two-word chunk is broken 
down into two single word subchunks and compared against the hashcounter. 
If the hashcounter indicates that both subchunks occur multiple times then the 
chunk is inserted into the second hashcounter. Otherwise, the chunk is rejected. 
After this process is complete, the second hashcounter indicates whether a par- 
ticular chunk of size two is a possible duplicate chunk. For chunks of size three, 
we pass a sliding window of length three over the collection and decompose the 
candidate chunks into two subchunks of length two. We similarly accept a chunk 
only if it is indicated by the hashcounter that both subchunks occur multiple 
times within the collection. Figure 1 illustrates this process. 

The algorithm can be extended to any desired chunk size I by iteration, 
at each phase incrementing the chunk size by one. We only ever require two 
hashcounters because the hashcounter for chunks of size n — 2 is no longer re- 
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Fig. 1. The process for inserting a new chunk into the hashcounter in SPEX. The chunk 
“the quick brown” is divided into two sub-chunks “the quick” and “quick brown” They 
are each hashed into the old hash table. If the count for both sub-chunks is greater than 
one, the full chunk is hashed and the counter at that location in the new hashcounter 
is incremented. 

quired when searching for chunks of size n and may be reused. We are not overly 
concerned about false positives, because subsequent iterations tend to have a 
dampening rather than an amplifying effect on their presence. Spex is thus able 
to provide an accurate representation of duplicate chunks of length m in a time 
proportional to 0{uv), where v is the length of the document collection. 

5 The DECO Package 

Our DECO system for co-derivative detection presents a number of innovations. 
The most significant of these is the use of spex for creating shared-chunk in- 
dexes. Another addition is the inclusion of more sophisticated scoring functions 
for determining whether documents are co-derived. Deco operates in two phases: 
index building and relationship graph generation. In the index building phase, 
SPEX is used as described earlier. At the final iteration of the algorithm, the 
chunks that are identified as occurring more than once are stored in an inverted 
index structure (Witten et al. 1999). This index contains an entry for each du- 
plicate chunk and a list of each document where it occurs. We call this index the 
shared-chunk index. 

In the relationship graph generation phase, DECO uses the shared-chunk index 
and an approximate counting technique similar to that proposed by Shivakumar 

6 Garcia-Molina (1999) in order to identify co-derived document pairs. Several 
parameters must be specified to guide this process: the most important of these 
are the scoring function and the inclusion threshold. Given documents u and v, 
the scoring function may at present be one of the following: 

Si{u,v) =Ecg«acg«1 S 2 {u,v) =X)cg«acg^ VminM,h 

S3{u,v) =X)cGuAcG« Vmeanu,h S'4(w,r;) = ScG^AcG^ melll.v 

where u is the length (in words) of a document u, and fc is the number of col- 
lection documents a given chunk c appears in. Function Si above simply counts 
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the number of chunks common to the two documents; this elementary scoring 
method is how fingerprinting algorithms have worked up to now. Functions S 2 
and S 3 attempt to normalise the score relative to the size of the documents, 
so that larger documents don’t dominate smaller ones in the results. They are 
very similar to the resemblance measure of Broder (1997) but are modified for 
more efficient computation. Function S 4 gives greater weight to phrases that 
are rare across the collection. These scoring functions are all simple heuristics; 
further refinement of these functions and the possible use of statistical models 
is desirable and a topic of future research. 

The inclusion threshold is the minimum value of S{u,v) for which an edge 
between u and v will be included in the relationship graph. We wish to set the 
threshold to be such that pairs of co-derived documents score above the threshold 
while pairs that are not co-derived score below the threshold. 

6 Experimental Methodology 

We use three document collections in our experiments. The wehdata+xml and lin- 
uxdocs collections were accumulated by Hoad & Zobel (2003). The webdata+xml 
collection consists of 3,307 web documents totalling approximately 35 megabytes, 
into which have been seeded nine documents (the XML documents), each of 
which is a substantial edit by a different author of a single original report dis- 
cussing XML technology. Each of these nine documents shares a co-derivation 
relationship with each of the other eight documents, though in some cases they 
only have a relatively small quantity of text in common. The linuxdocs collection 
consists of 78,577 documents (720 MB) drawn from the documentation included 
with a number of distributions of RedHat Linux. While the wehdata+xml collec- 
tion serves as an artificial but easily-analysed testbed for co-derivative identifi- 
cation algorithms, the linuxdocs collection, rich in duplicate and near-duplicate 
documents, is a larger and more challenging real-world collection. 

The LATimes collection is a 476 megabyte collection of newswire articles 
from the Los Angeles Times, one of the newswire collections created for the 
TREC conference (Harman 1995). This collection is used as an example of a 
typical document collection and is used to investigate the index growth we may 
expect from such a typical collection. 

We define a collection’s reference graph as the relationship graph that would 
be generated by a human judge for the collection^. The coverage of a given 
computer-generated relationship graph is the proportion of edges in the refer- 
ence graph that are also contained in that graph, and the density of a relationship 
graph is the proportion of edges in that graph that also appear in the reference 
graph. While these two concepts are in many ways analogous to the traditional 
recall and precision metrics used in query-based information retrieval (Baeza- 
Yates & Ribeiro-Neto 1999), we choose the new terminology to emphasise that 
the task is quite different to querying: we are not trying to meet an explicit 

^ Although the concept of an ‘ideal’ underlying relationship graph is a nsefnl artifice, 
the nsual caveats of subjectivity and relativity must be borne in mind. 
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information need, but are rather attempting to accurately identify existing in- 
formation relationships within the collection. 

To estimate the density of a relationship graph, we take a random selection of 
edges from the graph and judge whether the documents they connect are in fact 
co-derived. To estimate the coverage of a relationship graph, we select a number 
of representative documents and manually determine a list of documents with 
which they are co-derived. The coverage estimate is then the proportion of the 
manually determined pairings that are identified in the relationship graph. A 
third metric, average precision, is simply the average proportion of co-derivative 
edges to total edges for the documents selected to estimate coverage. While it is 
an inferior measure to average density, it plays a role in experimentation because 
it is far less time-consuming to calculate. 

7 Testing and Discussion 

Index Growth Rate. In order to investigate the growth trend of the shared-chunk 
index as the source collection grows, we extracted subcollections of various sizes 
from the LATimes collection and the linuxdocs collection, and observed the 
number of duplicate chunks extracted as the size of the collection grew. 

This growth trend is important for the scalability of SPEX and by extension 
the DECO package: if the growth trend were quadratic, for example, this would set 
a practical upper bound on the size of the collection which could be submitted 
to the algorithm, whereas if the trend were linear or nlog(n) then far larger 
collections would become practical. We found that, for this collection at least, the 
growth rate follows a reasonably precise linear trend. For the LATimes collection, 
40 MB of data yielded 54,243 duplicate chunks; 80 MB yielded 126,542; 160 
MB 268,128; and 320 MB 570,580 duplicate chunks. While further testing is 
warranted, a linear growth trend suggests that the algorithm has potential to 
scale extremely well. 

Wehdata+XML Experiments. Because the wehdata+xml collection contains the 
nine seed documents for which we have exact knowledge of co-derivation re- 
lationships, it makes a convenient collection for proving the effectiveness of the 
DECO package and determining good parameter settings. Using DECO to create a 
shared-chunk index with a chunk size of eight took under one minute on an Intel 
Pentium 4 PC with 512 MB of RAM. For this collection, we tested DECO using 
the four scoring functions described in section 5. For each scoring function, we 
tested a range of five inclusion thresholds, named - in order of increasing value 
- T\ to T 5 ; the values vary between the scoring functions and were chosen based 
on preliminary experiments. Each of the 20 generated relationship graphs were 
then tested for the presence of the 36 edges connecting the XML documents to 
each other. 

As can be seen in Table 1, the estimated coverage values strongly favour 
the lower inclusion thresholds. Indeed, for all scoring functions using the in- 
clusion threshold Ti, 100% of the pairings between the XML documents were 
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Table 1. Coverage estimates, as percentages, for the webdata+xml collection calculated 
on the percentage of XML document pairings identified. The average precision was 
100% in all cases. 





Ti 


Tz 


Ts 


T4 


n 


Si 


100.0 


97.2 


36.1 


8.3 


0.0 


S2 


100.0 


100.0 


83.3 


58.3 


25.0 


S3 


100.0 


91.7 


72.2 


52.8 


16.7 


S4 


100.0 


97.2 


91.7 


58.3 


22.2 



Table 2. Coverage and average precision estimates, as a pair X/Y of percentages, 
for DECO applied to the linuxdocs collection, using a full shared-chunk index and for 
indexes that store chunks only if their hash-key equals zero modulo 16 and 256. 





Ti 


Tz 


T3 


T4 


n 


Full chunk indexing 








Si 


100/ 70 


89/ 71 


56/ 93 


36/ 95 


34/100 


S2 


100/ 57 


100/ 75 


100/ 92 


89/ 94 


57/100 


Ss 


98/ 75 


96/ 84 


94/100 


84/100 


47/100 


S4 


99/ 83 


96/ 91 


94/100 


78/100 


30/100 


Fingerprinting modulo 16 








Si 


90/ 72 


88/ 76 


56/ 94 


36/ 96 


34/100 


S2 


90/ 75 


90/ 75 


80/ 94 


78/100 


57/100 


S3 


88/ 82 


86/ 91 


74/100 


74/100 


47/100 


S4 


88/ 85 


86/ 93 


86/ 93 


69/100 


60/100 


Fingerprinting modulo 256 








Si 


54/ 95 


54/ 95 


54/ 95 


54/ 95 


34/ 97 


S2 


54/ 97 


54/ 97 


54/ 97 


54/ 97 


44/ 97 


S3 


54/ 97 


54/100 


54/100 


51/100 


42/100 


S4 


54/ 97 


54/100 


54/100 


44/100 


31/100 



included in the relationship graph. In all cases the average precision was also 
100%. These values - 100% coverage and 100% density - suggest a perfect re- 
sult, but are certainly overestimates. The nature of the test collection - nine 
co-derived documents seeded into an entirely unrelated background collection 
~ made it extremely unlikely that spurious edges would be identified. This not 
only introduced an artificially high density estimate but also strongly biased the 
experiments in favour of the lower inclusion thresholds, because they allowed all 
the correct edges to be included with very little risk that incorrect edges would 
likewise be admitted. 

Experiments on the Linux Documentation Collection. For the linuxdocs collec- 
tion, we used deco to create a shared-chunk index with a chunk size of eight, 
taking approximately 30 minutes on an Intel Pentium 4 PC with 512 MB of 
RAM. For generation of relationship graphs we used the same range of scoring 
functions and inclusion thresholds as in the previous section. We wished also 
to investigate the level of deterioration witnessed in a fingerprinting strategy 
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as the selectivity of the fingerprint increased; to this end, we experimented with 
relationship graphs generated from indexes generated using the modulo heuristic 
with k = 16 and k = 256. The inclusion threshold for these experiments were 
adjusted downward commensurately. 

To estimate the coverage of the relationship graphs, we selected ten docu- 
ments from the collection representing a variety of different sizes and types, and 
manually collated a list of co-derivatives for each of these documents. This was 
done by searching for other documentation within the collection that referred to 
the same program or concept; thus, the lists may not be entirely comprehensive. 
Estimated coverage and average precision results for this set of experiments are 
given in Table 2. Several trends are observable in the results. The first of these 
is that in general, scoring functions S 2 , S 3 , and S 4 were more effective than the 
simple chunk-counting Si scoring function Another trend is that performance 
is noticeably superior with the full shared-chunk index than with the selective 
shared-chunk indexes. Note in particular that, for the modulo 256 index, no con- 
figuration was able to find more than 54% of the relevant edges. This is almost 
certainly because the other 46% of document pairs do not have any chunks in 
common that evaluate to 0 modulo 256 when hashed. This illustrates the dangers 
of using lossy selection schemes when a high degree of reliability is desired. 

We had insufficient human resources to complete an estimate of density for 
all of the relationship graphs generated. Instead, we selected a range of config- 
urations that seemed to work well and estimated the density for these configu- 
rations. This was done by picking 30 random edges from the relationship graph 
and manually assessing whether the two documents in question were co-derived. 
The results were pleasingly high: S 2 /T 3 /I, S 3 /T 2 / 266 , and S'4/T3/16 all scored 
a density of 93.3% (28 out of 30) while S' 4 /T 3 /l and S'2/Ti/16 both returned an 
estimated density of 100%. Other combinations were not tested. 

8 Conclusions 

There are many reasons why one may wish to discover co-derivation relationships 
amongst the documents in a collection. Previous feasible solutions to this task 
have been based on fingerprinting algorithms that used heuristic chunk selection 
techniques. We have argued that, with these techniques, one can have either 
reliability or acceptable resource usage, but not both at once. 

We have introduced the SPEX algorithm for efficiently identifying shared 
chunks in a collection. Unique chunks represent a large proportion of all chunks 
in the collection ~ over 98% in one of the collections tested - but play no part in 
discovery of co-derivatives. Identifying and discarding these chunks means that 
document fingerprints only contain data that is relevant to the co-derivative dis- 
covery process. In the case of the LATimes collection, this allows us to create 
an index that is functionally equivalent to full fingerprinting but is one fiftieth 
of the size of a full chunk index. Such savings allow us to implement a system 
that is effective and reliable yet requires only modest resources. 

Tests of our DECO system, which used the SPEX algorithm, on two test col- 
lections demonstrated that the package is capable of reliably discovering co- 
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derivation relationships within a collection, and that introducing heuristic chunk- 
selection strategies degraded reliability. 

There is significant scope for further work and experimentation with DECO. 
One area of particular importance is the scalability of the algorithm. We have 
demonstrated that the system performs capably when presented with a highly 
redundant 700 MB collection and are confident that it can handle much larger 
collections, but this needs to be experimentally demonstrated. Another impor- 
tant further development is the design of an adjunct to the SPEX algorithm that 
would make it possible to add new documents to a collection without rebuilding 
the entire shared-chunk index. The difficulty of extending the index is the one 
major defect of SPEX compared to many other fingerprinting selection heuris- 
tics. However, the sensitivity, reliability and efficiency of SPEX make it already 
a valuable tool for analysis of document collections. 
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We concentrate in this paper on multiple pattern matching, in which a set of pat- 
terns S = {P\, . . . ,Pk}, rather than a single one, is to be located in a given text 
T. This problem has been treated in several works, including Aho and Corasick, 
Commentz- Walter, Uratani and Takeda and Crochemore et al. None of these 
algorithms assumes any relationships between the individual patterns. Never- 
theless, there are many situations where the given strings are not necessarily 
independent. 

Consider, for example, a large full text information retrieval system, to which 
queries consisting of terms to be located are submitted. If a user wishes to 
retrieve information about computers, he might not want to restrict his query to 
this term alone, but include also grammatical variants and other related terms, 
such as under-computerized, recomputation, precompute, computability, 
etc. Using wild-cards, one could formulate this as *comput*, so that all the 
patterns to be searched for share some common substring. A similar situation 
arises in certain biological applications, where several genetic sequences have to 
be located in DNA strings, and these sequences may have considerable overlaps. 

The basic idea is the following: if we can find a substantial overlap s, shared 
by all the patterns in the set, it is for s that we start searching in the text, using 
any single pattern matching algorithm, for example BM. If no occurrence of s 
is found, none of the patterns appears and we are done. If s does occur t > 0 
times, it is only at its t locations that we have to check for the appearance of the 
set of prefixes of s in the set of patterns and of the corresponding set of suffixes. 
This can be done locally at the t positions where s has been found, e.g., with 
the AC algorithm, but with no need to use its fail function. 

More formally, let the set S consist of patterns Pi, where Pi = U sri, and li 
and Ti are the (possibly empty) prefixes and suffixes of Pi which are left after 
removing the substring s. For our example, {li} = {under-, re, pre. A}, where 
A denotes the empty string, and {r^} = {erized, ation, e, ability}. Denote 
also the length of Pi by rm and the total length X/i=i by M. The algorithm 
starts by identifying s, the longest common substring shared by P\, . . . , Pk- The 
search algorithm is then given by: 

Overlap_Matching (s, S') 

search for s in text T using KMP or BM 

for each i such that s is found starting at position i 

check at position i + |s| — 1 for an occurrence of an element of {rj} 
using an AC automaton 
for each matching r-j found 

check if Ij matches T at position i — \lj\ 
if yes, declare match at i — \lj\ 



A. Apostolico and M. Melucci (Eds.): SPIRE 2004, LNCS 3246, pp. 68—69, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The dominant part of the time complexity will generally be for the search of 
s, which can be done in time 0(|T|). Applying the AC automaton is not really a 
search, since it is done at well-known positions. Its time complexity is therefore 
bounded 0 {t ■ max{mi}), where t is the number of occurrences of the overlap 
s. Only in case the occurrences of s are so frequent that the potential positions 
of the patterns cover a large part of the string T may 0(t ■ max{mi}) be larger 
than 0(|T|), but this will rarely occur for a natural language input string. 

Consider now as example the dictionary S = {Pi = dxyz, P 2 = wdxyza, P 3 = 
bcdxyzw, P 4 = bcdzw}, showing a major deficiency of the suggested algorithm. 
The longest substring shared by all the patterns is just a single character, d or 
z. One can of course circumvent the case, which is even worse from our point 
of view, when the longest common substring is empty, by invoking then the 
standard AC routine. But if there is a non-empty string s, but it is too short, it 
might occur so often that the benefit of our procedure could be lost. 

In the above example, the string dxyz is shared by only three of the four 
elements of the dictionary S, but its length is much longer than the string shared 
by all the elements. This suggests that it might be worthwhile not to insist on 
having the overlap cover the entire dictionary, but maybe to settle for one shared 
not by all, but at least by a high percentage of the patterns, which may allow 
us to choose a longer overlap. An alternative approach could be to look for 
two or more substrings si,S 2 ,..., each longer than s and each being shared 
by the patterns of some proper subset of S, but which together cover the entire 
dictionary. For our example we could, e.g., use the pair of substrings {dxyz, bed}. 
In Information Retrieval applications, such an approach could be profitable in 
case the query consists of the grammatical variants of two or more terms, or in 
case of irregularities, as in the set {go, goes, undergoing, went). 

To get a general feeling of how the algorithms behave in some real-life appli- 
cations, we ran a set of tests on several text and DNA files. As a measure for 
the efficiency, we defined a rate as the number of symbol comparisons divided 
by the length of the text. The Aho Corasick algorithm served as benchmark, 
yielding always a rate of 1. The graphs in Figure 1 give the rate, for each of the 
algorithms we considered, as a function of the overlap size. 




Fig. 1. Comparative performance for text and DNA files 

As can be seen, the rate is strictly decreasing with the overlap size, and rates 
as low as 0.4 on the average can be reached already for relatively short overlaps 
of size 4-5. 



Linear Nondeterministic Dawg String 
Matching Algorithm (Abstract) 



Longtao He and Binxing Fang 

Research Center of Computer Network and Information Security Technology 
Harbin Institute of Technology, Harbin 150001, P.R. China 

The exact string matching problem is to find all the occurrences of a given 
pattern x = X\X 2 • ■ ■ Xm in a large text y = yiy 2 • • • yn, where both x and y are 
sequences of symbols drawn from a finite character set S of size a. 

Various good solutions have been presented during years. Among them, 
BNDM [^1 is a very efficient and flexible algorithm. It simulates the BDM algo- 
rithm using bit-parallelism. BNDM first builds a mask table B for each symbol 
c. The mask in B[c] has the i-th bit set if and only if Xi = c. The search state is 
kept in a computer word L = Lm ■ ■ ■ L\, where the bit Li at iteration I is set if 
and only if a;^ • • • Xi+i-i = j/j-;+i • • • yj, where j is the end position of the current 
window. Each time we position the window in the text we initialize L = 1™ and 
scan the window backward. For each new text character we update L with the 
formula: 

L ^ {L k B[y,_i\) » 1 (1) 

Each time we find a prefix of the pattern (Li = 1) we remember the position in 
the window. If we run out of I’s in L, there cannot be a match and we suspend the 
scanning and then shift the window to the right. If we can perform m iterations 
then we report a match. 

BNDM uses only one computer word to keep the search state. The over- 
flow bits in the shift right of the search state are lost. This is why BNDM has 
a quadratic worst case time complexity. We present a new purely bit-parallel 
variant of BNDM, which we call the Linear Nondeterministic Dawg Matching 
algorithm (LNDM). LNDM makes use of two computer words L and R. L is 
the traditional state. The additional computer word R keeps the overflow bits 
in the shift right of the search state L during the backward scan. The formulas 
to update the search state are changed to: 

L ^ L k B[y,_i] (2) 

(LR) ^ {LR) » 1 (3) 

where {LR) means concatenation of L and R. Instead of checking if L\ = I, 
LNDM just right-shifts {LR). The scan goes until L is equal to O’”. Then, if 
R yf 0™, we resume a forward scan after the end of current window with a 
nondeterministic automaton initialized by the saved bits R << {m — 1). 

In the additional forward scan stage, LNDM runs as a reverse Backward 
Nondeterministic Dawg, to be precise, a Forward Nondeterministic Dawg. For 
each new text symbol we update R with the formula: 

R ^ {R«l) k B[y^+r] (4) 
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LNDM (x = XiX 2 ■ ■ - Xm, y = yiV 2 ■ --yn) 


1. 


Preprocessing 


2. 


For c G S do B[c] <— O’” 


3. 


For 2 G 1 • • • m do 


4. 


B\xi] ^ B[xi] 1 


5. 


Search 


6. 


For fc G 1 ■ ■ ■ do 


7. 


Z ^ 0. r ^ 0 


8. 


L ^ 1”, 


9. 


While L ^ O’" do 


10. 


L^Lk B[ykm-i] 


11. 


l^l + l 


12. 


(LR) ^ (LR) » 1 


13. 


End of while 


14. 


R » {m — 1) 


15. 


While i? ^ O’" do 


16. 


r ^ r + 1 


17. 


If 7 ^ O’” then 


18. 


output km -\- r — m 


19. 


End of if 


20. 


R ^ {R « 1) k B[ykm+r] 


21. 


End of while 


22. 


End of for 




Fig. 1. The Pseudo-code of LNDM. Fig. 2. Running time for alphabet size 256. 



Since LNDM scans in the reverse direction against BNDM in this stage, the 
formula differs from that of BNDM in the shift direction. Each time we meet 
the situation Rm = 1 we report a match. If we run out of I’s in R, there cannot 
be a match and the algorithm shifts to next window. 

The algorithm is summarized in Fig.l. With this approach, LNDM can safely 
shift by m fixedly after each attempt. This important improvement enables 
LNDM to have optimal time complexities respectively in the worst, best and 
average cases: 0(n), 0(n/m) and (0(n(log^ m)/m)) for the pattern not longer 
than computer word. 

We compared the following algorithms with LNDM: BM, QS, BDM, Turbo- 
BDM, BNDM, Turbo-BNDM, and SBNDM. Fig. 2 shows the experimental results 
over alphabet size 256. The results over large alphabets is similar. The x axis 
is the length of the patterns, and the y axis shows the average running time 
in second per pattern per MB text of each algorithm. The results show that 
LNDM is very fast in practice for large alphabet. Among the worst case linear 
time algorithms, LNDM is the most efficient one. 
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The goal of scaled permuted string matching is to find all occurrences of a pattern 
in a text, in all possible scales and permutations. Given a text of length n and 
a pattern of length m we present an 0{n) algorithm. 

Definition 1 Scaled permuted string matching 

Input: A pattern P = pi ■ ■ -pm and a text T = G • • • both over alphabet E. 
Output: All positions in T where an occurrence of a permuted copy of the 
pattern P, scaled to k, starts (k = 1,..., The pattern is first permuted 

and then scaled. 

Example: The string bbbbaabbaaccaacc is a scaled (to 2) permutation of 
baabbacc. 

1 Permuted String Matching 

over Run-Length Encoded Text 

The permuted string matching problem over uncompressed text is simply solved. 
A sliding window of size |P| can be moved over T to count, for each location 
of T the order of statistics of the characters. Obviously, this can be done in 
0(n) time. Let T' be the run-length compressed version of T where T' = 
• • • CT|pT| ' . Similarly, P' is the permuted run-length compressed pattern. The 
pattern can be permuted, and therefore, in each location of the text we check if 
the order of statistics of the characters is equal to that of the pattern. As a result, 
a better compression can be achieved. Symbols with the same character are 
compressed. For example, let P = aabbbaccaab, its run- length compressed version 
is afb^a^c^afb^ and its permuted run-length compressed version is P' = af'b^c^ . 
The technique we use is similar to the sliding window technique: a window is 
shifted on T' from left to right in order to locate all the matches. The window 
is a substring of T' that represents a candidate for a match. Unlike the simple 
algorithm, this time the window size is not fixed. 

We will define a valid window as a substring of T' that fulfills the follow- 
ing two properties: sufficient - The number of times each character appears in 
the window is at least the number of times it appears in the pattern, minimal 

* Partially supported by NSF grant CCR-0104307, by the Israel Science Foundation 
grant 282/01, and by IBM Faculty Award. 

A full version of the paper appears in http://cs.haifa.ac.il/LANDAU/public.html 
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- Removing the rightmost or the leftmost symbol of the window violates the 
sufficient property. 

The algorithm scans the text, locates all valid windows and finds the ones in 
which a permuted copy of the pattern occurs. During the scan of the text, given a 
valid window, it is trivial to check if it contains a match. Hence, we will describe 
only how to locate all valid windows. The valid windows are found by scanning 
the text from left to right, using two pointers, left and right. To discover each 
valid window, the right pointer moves first to find a sufficient window and then 
the left pointer moves to find the valid window within the sufficient window. 
The right pointer moves as long as deleting the leftmost symbol of the window 
violates the sufficient property of the window. When this symbol can finally 
be removed, the right pointer stops and the left pointer starts moving. The left 
pointer moves as long as deleting the leftmost symbol of the window does not 
violate the sufficient property of the window. At this point, a new valid window 
has been found. 

Example: Let P' = afb^c^df and T' = c^afc^af'd’^h^c^ then c^afc^af'dfb^ is the 
first sufficient window, and c^a^dfb^ is the first valid window (but not a match). 
Time Complexity: We assume that |A| is 0(|P'|), hence, the time complexity 
of the algorithm is 0(|P'| + |T'|). 

2 A Linear Time Algorithm 

for the Scaled Permuted String Matching Problem 

The algorithm is composed of two stages: 1. Preprocessing the text P'. Comput- 
ing compact copies of the text for each possible scale 1 < s < ^. 2. Applying the 
permuted string matching over the run-length encoded text algorithm (section 
1) on the copies of the text. 

We observe that if a permutation of P scaled to s occurs in ■ ■ ■ CfeA then 
ji+i, . . . , jk-i are multiple of s, and > s. Hence, we compute for each 

scale s a compact text in the following two steps: (In order to simplify the 
computation of Stage 2, a symbol tffi^ of T' is replaced in by tj^^^ .) 

Step 1. Locating the regions — T' is scanned from left to right. Consider a 
symbol tAb A new symbol ti^ is added to if is a multiple of s. It may 
continue a region or start a new one, in the second case we add a separator ($) 
between the regions. 

Step 2. Expansion of the regions — The last refinement is done by scanning 
each text from left to right and expanding all the regions we generated in 
step 1. 

Example: Let T' = a^b'^c'^af'df'b^dfc^b'^af, the new text after applying step 2 
is: T{ =$ a^b'^c^a^df‘b^dfc^b‘^aJ% , = %a^b^c^a^%b‘^dfc^b'^a^%, Tg = 

%d^b^%, Ti = %c^%cffi^a^%, = $ 0 ^$, 

= %b^% 

Stage 2 runs the permuted string matching over a run-length encoded text 
algorithm (section 1) on all the new compact texts. 

Time Complexity: The running time of both Stage 1 and Stage 2 is bounded 
by the total length (0(n)) of the new texts, therefore, the total time complexity 
is 0(n). 
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Main Results. We consider the problem of longest common subsequence (LCS) 
of two given strings in the case where the first may be shifted by some constant 
(i.e. transposed) to match the second. For this longest common transposition 
invariant subsequence (LOTS) problem, that has applications for instance in 
music comparison, we develop a branch and bound algorithm with best case 
time 0((m^ + log log cr) log <t) and worst case time 0{{m? + logCT)a), where m 
and cr are the length of the strings and the number of possible transpositions, 
respectively. This compares favorably against the 0{crm?) naive algorithm in 
most cases and, for large m, against the O(m^loglogm) time algorithm of [2]. 

Technical Details. Let A = ai, . . . , a„ and B = bi, . . . ,bm he two strings, over a 
finite numeric alphabet Y = {0 . . . a}. A subsequence of string A is obtained by 
deleting zero, one or several characters of A. The length of the longest common 
subsequence of A and B, denoted LCS{A, B), is the length of the longest string 
that is a subsequence both of A and B. 

The conventional dynamic programming approach computes LCS{A, B) in 
time 0{mn), using a well-known recurrence that can be easily adapted to com- 
pute LCS{A + c, B), where A + c= (oi -I- c), . . . , (a„ -be), for some transposition 
c, where —a < c < a: 

LCSlo = 0-, LCSlj = 0-, 

lost = if a- + c = bj then 1 -b else LCSlj_-^). 

Our goal is to compute the length of the longest common transposition in- 
variant subsequence. 



LCTS{A,B)= max LCS^{A,B). 

C^ — <7...(7 

Let X denote a subset of transpositions and LCS^ {A, B) be such that 
tti+i and bj+i match whenever bj+i — a^+i G X. Now, it is easy to see that 
LCS^ {A, B) > maxcex LC'S"”(A, B), so LCS^ {A, B) may not contain the ac- 
tual maximum LCS'^{A, B) for c G A but gives an upper bound. Our aim is to 
find the maximum LCS‘^{A, B) value by successive approximations. 
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We form a binary tree whose nodes have the form [/, J] and represent the 
range of transpositions X = . J}. The root is [—a, a]. The leaves have the 

form [c, c]. Every internal node [I, J] has two children [/, [(/ -|- J)/2J] and [[(/ -|- 
J)/2J +1,J], 

The hierarchy is used to upper bound the LCS‘^{A, B) values. For every node 
[/, J] of the tree, if we compute {A, B), the result is an upper bound 

to LCS‘^{A, B) for any I < c < J. Moreover, LCS^{A, B) is easily computed in 
0{mn) time if Ai = {/... J} is a continuous range of values: 

LCS^o = 0-, LCS^j = 0-, 

LCSf^j = if bj — € X then 1 + else LCSX_i). 

We already know that the LCS value of the root is min(m,n), since every 
pair of characters match. The idea is now to compute its two children, and 
continue with the most promising one (higher LCS^ upper bound). For this 
most promising one, we compute its two children, and so on. At any moment, 
we have a set of subtrees to consider, each one with its own upper bound on the 
leaves it contains. At every step of the algorithm, we take the most promising 
subtree, compute its two children, and add them to the set of subtrees under 
consideration. If the most promising subtree turns out to be a leaf node [c, c], 
then the upper bound value is indeed the exact LCS'^ value. At this point we 
can stop the process, because all the upper bounds of the remaining subtrees are 
smaller or equal than the actual LCS‘^ value we have obtained. So we are sure 
of having obtained the highest value. 

For the analysis, we have a best case of log2(2a -I- 1) = O(logcr) iterations 
and a worst case of 2(2 ct +1) — l = 4a+l = 0(a) until we obtain the first 
leaf element. Our priority queue, which performs operations in logarithmic time, 
contains O(logCT) elements in the best case and 0(a) in the worst case. Hence 
every iteration of the algorithm takes 0(m^ -flog log a) at best and 0{m^ +log a) 
at worst. This gives an overall best case complexity of 0((m^ -I- log log cr) logcr) 
and 0{{m? + logcr)cr) for the worst case. The worst case is not worse than the 
naive algorithm for m = f2 {\/log a), which is the case in practice. 

By using bit-parallel techniques that perform several LCS^ computations 
at the same time [1], the algorithm can be extended to use a t-ary tree. 

This technique can be applied also to any distance d satisfying min^gx 
B) < d^{A, B), where d^ {A, B) is computed by considering that a^+i and bj+\ 
match whenever bj+\ — ai+i G X. This includes d-LCS, general weighted edit dis- 
tance, polyphony, etc., so it enjoys of more generality than most of the previous 
approaches. It cannot, however, be easily converted into a search algorithm. 
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Abstract. We propose a new feature normalization scheme based on eigen- 
space, for achieving robust speech recognition. In particular, we employ the 
Mean and Variance Normalization (MVN) in eigenspace using unique and in- 
dependent eigenspaces to cepstra, delta and delta-delta cepstra respectively. 
We also normalize training data in eigenspace and get the model from the nor- 
malized training data. In addition, a feature space rotation procedure is intro- 
duced to reduce the mismatch of training and test data distribution in noisy 
condition. As a result, we obtain a substantial recognition improvement over the 
basic eigenspace normalization. 



1 Proposed Scheme 



We separated the feature vector into three classes as cepstra, delta and delta-delta 
cepstra because each class has its own definition and characteristics. Then we imple- 
mented a separated-eigenspace normalization (SEN) scheme. 

When cepstral features are distorted by noisy conditions, their distribution can be 
moved as well as rotated by some amount from their original distribution. [2] When 
we rotate only the dominant eigenvector that has the largest variance or eigenvalue, 
the first eigenvectors of training and test features become the same and the mismatch 
between the training and test data distribution can be reduced. Only the first eigen- 
vector rotation procedure is presented here simply as follows. First, we need to 
obtain the eigenvalue and eigenvector of full training corpus, v denotes the first 
dominant eigenvector of the training distribution and V denotes the first dominant 
eigenvector of one test utterance. Then the rotation angle OC , between the two eigen- 



R = 



where R denotes a rotation matrix. Since the two eigen- 



vectors, is computed from their dot product, Ct = arccos(v • v) and 
cos(0!) sin(fl!) ^ 

-sin(o:) cos(o:) j 

vectors are not orthogonal, the Gram-Schmidt is applied to v in order to obtain the 

orthonormal basis vector v lying in the same plane of rotation, y = v — (v ■ v) ■ v 

||v-(v -v)-v II 

Then we project the test features onto the plane spanned by v and v . The projection 
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matrix consists of v and v , thus J = (F,v) • Finally, a correction matrix 1 — JJ^ 
with the identity matrix I has to he applied in order to restore the dimensions lost in 
the projection procedure. Then the full rotation matrix Q is derived as: 

Q = JRJ^ + 1 — JJ^ ■ Finally, the rotated feature is obtained by: = qx‘ ■ 

2 Experiments and Results 

Recognition Task: The feature normalization method has been tested with the 
Aurora2.0 database that contains English connected digits recorded in clean environ- 
ments. Three sets of sentences under several conditions (e.g. SetA: subway, car noise, 
SetB: restaurant, street and train station noise, SetC: subway and street noise) were 
prepared by contaminating them with SNRs ranging from -5dB to 20dB and clean 
condition. A total of 1001 sentences are included in each noise condition. 
Experiments Procedure and Results: We followed the Aurora2.0 evaluation proce- 
dure for performance verification along with identical conditions suggested in the 
Aurora2.0 procedure. Note that we use a cO coefficient instead of log-energy to in- 
duce improved performance, because eigenspace is defined consistently when some 
of elements have large variance. First we examine the baseline performance (clean 
condition training). We then apply MVN [3] and the eigenspace MVN to only the 
test data and to both training and test data together. Next, we experimented on sepa- 
rated-eigenspace normalization (SEN). The feature space rotation with SEN was 
examined also. The experiment notations of Tables are as follows: 1) MVN : mean 
and variance normalization in cepstral domain, 2) EIG : mean and variance in eigen- 
space.[l] (eigenspace normalization), 3) SEN : separated-eigenspace normalization, 
4) SEN_Ro_20 : separated-eigenspace normalization H-feature space rotation. The 
first eigenvector of the test is obtained from training noisy set’s 20dB data of each. 

Erom Table 1, we can see that SEN with feature rotation and training data nor- 
malization is more effective than basic eigenspace normalization. 

We initially expected the best performance when each dominant eigenvector ob- 
tained from each SNR was applied to the corresponding SNR test set. However, it 
turns out that such method does not guarantee the improvement. At low SNR, the 
performance becomes slightly degraded. We achieved the best performance when 
applying an eigenvector of 20dB set to all SNR data of same test set. 

Table 1. Average word accuracy for the proposed scheme of all data set in Aurora2.0(%) 
( _T denotes the normalization of training data ) 





Baseline 


MVN 


EIG 


EIG_T 


SEN 


SEN_T 


SEN_Ro_20_T 


SetA 


59.58 


77.90 


79.81 


80.43 


80.27 


80.51 


81.08 


SetB 


57.18 


79.49 


81.21 


82.87 


81.77 


82.49 


- 


SetC 


66.81 


77.90 


78.96 


79.23 


79.32 


79.10 


- 



At lower SNR, the data distribution in cepstral domain becomes more compressed. 
Consequently, their discriminative shapes (e.g. large variance) is diminished as the 
SNR becomes lower. That’s the reason why 20dB statistics yielded the best perform- 
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ance. From a 20dB noisy training database, we estimated the characteristics of corre- 
sponding noise and compensated for the feature reliably. Through the proposed 
methods, we obtained average word accuracy up to 81.08% on the setA of Aurora2.0. 
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Abstract. We developed a dynamic programming approach of com- 
puting common sequence/structure patterns between two RNAs given 
by their sequence and secondary structures. Common patterns between 
two RNAs are meant to share the same local sequential and structural 
properties. Nucleotides which are part of an RNA are linked together 
due to their phosphodiester or hydrogen bonds. These bonds describe 
the way how nucleotides are involved in patterns and thus delivers a 
bond-preserving matching definition. Based on this definition, we are 
able to compute all patterns between two RNAs in time 0{nm) and 
space 0{nm), where n and m are the lengths of the RNAs, respectively. 
Our method is useful for describing and detecting local motifs and for 
detecting local regions of large RNAs although they do not share global 
similarities. An implementation is available in CH — h and can be obtained 
by contacting one of the authors. 



1 Introduction 

RNAs are polymers consisting of the four 
nucleotides A,C,G and U which are linked 
together by their phosphodiester bonds. 

This chain of nucleotides is called the pri- 
mary structure. Bases which are part of the 
nucleotides form hydrogen bonds within 
the same molecule leading to structure for- 
mation. One major challenge is to find 
(nearly) common patterns in RNAs since 
they suggest functional similarities of these 
molecules. For this purpose, one has to in- 
vestigate not only sequential features, but also structural features. The structure 
in combination with the sequence of a molecule dictates its function. Finding 
common RNA motifs is currently a hot topic in bioinformatics since RNA has 
been identified as one of the most important research topics in life sciences. RNA 
was selected as the scientific breakthrough of the year 2002 by the reader of the 
science journal. 




Fig. 1. Structure elements of an RNA 
secondary structure. 
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Most approaches on finding RNA sequence/structure motifs are based on (lo- 
cally) aligning two RNAs of lengths n. They use dynamic programming meth- 
ods with a high complexity between O(n^) and 0{vP) ([1], [9]). Hence, these 
approaches are suited for RNAs with just moderate sizes. For that reason, we 
want to use a general approach that is inspired by the DIALIGN [10] method for 
multiple sequence alignments. The basic idea is to find exact patterns in large 
RNAs first, and then to locally align only subsequences containing many exact 
patterns by using a more complex approach like [1[. 

So far, the problem of finding local, exact common sequence/structure pat- 
terns was unsolved. This is the problem which is considered in this paper. We can 
list all patterns between two RNAs in time 0{nm) and space 0{nm), where n 
and m are the lengths of the RNAs, respectively. The key idea is a dynamic pro- 
gramming method that describes secondary structures not only as base pairing 
interactions but at a higher level of structure elements known as hairpin loops, 
right bulges, left bulges, internal loops or multi-branched loops (see Figure 1). 
The computation of RNA patterns is performed on loop regions from inside to 
outside. Base-pairs which enclose loops occur in a nested fashion, i.e nested base- 
pairs fulfill for any two base-pairs (*i,* 2 ) and (ji,j 2 ) either ii < i 2 < j\ < j 2 
or i\ < ji < j 2 < Z 2 - Hence, we are able to obtain an elegant solution to the 
pattern search problem. 

A naive attempt is to consider all combinations 
of positions i in the first RNA and positions j in the 
id second RNA and to extend these starting patterns by 
“® looking at neighbouring nucleotides sharing the same 
sequential and structural properties. If these proper- 
Fig. 2. Alternative mat- fulfilled then the nucleotides are taken into the 

pattern. At a first glance, this idea may work, but the 
crucial point are the loops. Consider e.g. the case shown in Figure 2. Suppose 
the algorithm starts at position 1 in the first RNA and position 1 in the second 
RNA and is working towards the multiple loop in the first RNA. The lower stem 
has been successfully matched. But now there is no clear decision to match the 
upper part of the stem-loop of the second RNA either to the left side or to the 
right side of the multiple loop. This decision depends on how a common pattern 
is defined, of course, and how to reach a maximally extended pattern. Therefore, 
the only solution is to make some pre-computations of sequential and structural 
components of RNAs. Finally, we end up in a dynamic programming approach 
which compares inner parts of RNAs first, stores the results in different matrices 
and build up the solutions successively. Note, that it is also a mistake to compute 
common sequential parts first and then to recompose these parts by their struc- 
tural properties. This problem is obviously a computational intractable problem 
because of considering all combinations of subsets of sequence parts. 

Related Work: Wang et al.[13] published an algorithm for finding a largest ap- 
proximately common substructure between two trees. This is an inexact pattern 
matching algorithm suitable for RNA secondary structures. A survey of comput- 
ing similarity between RNAs with and without secondary structures until 1995 is 
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given by Bafna et al.[2]. Gramm et al. [5] formulated the arc-preserving problem: 
given two nested RNAs S\ and S 2 with lengths n and m {n > m), respectively, 
does S 2 occurs in Si such that S 2 can be obtained by deleting bases from Si 
with the property that the arcs are preserved ? This problem can not be seen 
as biological motivated because the structure of S 2 would be found splitted in 
Si- It has been shown by Jiang et al. [8] that finding the longest common arc- 
preserving subsequence for arc-annotated sequences (LAPCS), where at least 
one of them has crossing arc structure is MAXSNP-hard. Exact pattern match- 
ing on RNAs has been done by Gendron et al. [4] . They propose a backtracking 
algorithm, similar to an algorithm from Ullman [11] solving the subgraph iso- 
morphism problem from graph theory. It aims at finding recurrent patterns in 
one RNA. 

The paper is organized as follows: In section 2, we introduce the reader into 
definitions and notations of RNAs. In section 3, we define matchings between 
two RNAs such that they can be described by matching and matched paths. 
In Section 4, a bond preserving matching is proposed which is used for the 
dynamic programming matrices (section 5). The matrices are computed by re- 
cursion equations in section 6. The pseudo code is given in section 7. 



2 Definitions and Notations 



An RNA is a tuple (S,P), where S' is a string of length n over the alphabet 
S = {A,C,G,U} . We denote S{i) as the base at position h P is a set of 
base-pairs 1 < * < < n, such that S{i) and S{i') are complementary 

bases. Here, we refer to Watson-Grick base-pairs A — U and C — G, as well as 

p 

the non-standard base-pair G — U. In the following, we write i i' instead 
of € P meaning that the two bases S(i) and S(i’) are linked together by 

a bond. For the rest of the paper, we restrict our set of base-pairs to secondary 
structures holding the following property: for any two base-pairs and 
either i < i' < j < j' {independent) or i < j < j' < i' {nested). The nestedness 
condition allows us to partially order the bases of an RNA. 

Definition 1 (Stacking Order). Let {S,P) be an RNA. The stacking order 

p 

of a base S(i) (abbr. as stordp{i) ) is the number of bonds k I with k < i < I, 
plus one. 



Hence, we are able to partition a secondary structure into structure elements 
with the same stacking order. We call them loops. See e.g. Figure 1 for various 
loop names. For our algorithmic approach, we have to look at neighbouring bases 
belonging to the same loop. This is achieved by a function right (left) of an RNA 

{s,py. 



rightp(i) 



j if (*,j)eP 
i J- 1 otherwise 



and analogously for leftp(i). The function rightp(i) (resp. leftp(i)) is a short 
term of applying the right function (resp. left) to i fc-times. We define rbdp(*) 
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P P 

(resp. Ibdp(i)) to be true if there is a bond i i' (resp. i' i), false 

otherwise. Thus, we can describe loops mathematically as follows. 

Definition 2 (Loop). Let {S,P) be an RNA. The loop which is enclosed by a 

p 

bond i i' is the set of positions 

p , 

loop{i i') = {r \ i < r < i' f\3k \ r = rightp{i)}. 



3 Matchings 

Suppose we are given two RNAs {Si,Pi) and (S 2 ,P 2 )- The sets Fi = {* | 1 < 
i < |5'i|} and V 2 = {j \ 1 < j < |52|} contains the positions of both RNAs. 

Definitions (Matching). A matching M between two RNAs {S\^Pi) and 
(S' 2 ,^ 2 ) is a set of pairs M = {(j,j) | i € Vi A j G V 2 } which describes a partial 
bijection from V\ to V 2 and satisfies the following conditions: 

1. structure condition: for each (i,j) G M, it follows rbdp^{i) rbdp^{j)A 
lbdp^{i) Ibdp^ij) 

2. base condition: for each (i,j) G M it follows Si{i) = S 2 {j) 

The matching definition is applied to single bases. Since bases are sometimes part 
of base-pairs, we may see them as units given as an additional bond condition: 

3. bond condition: for each {{i, j), {i' , j')} C M with i——i' and j ~~ j' 
it follows S'i(f) = S 2 {j) A Si\i') = S 2 U') 

The range of the first RNA is given as the set rani(M) = {i \ 3j : {i,j) G M}. 
It describes the pattern found in the first RNA which is matched to the same 
pattern in the second RNA. Given an element i G rani(M), we denote M{i) as 
the uniquely determined element j with (i,j) G M. Similarly, given an element 
J G ran 2 (M), we denote as the uniquely determined element i with 

(ij) G M. 

The first two points of the definition can be easily written as a matching 
predicate between two bases at positions i and j : 

match(i, j) = [^(f) = S{j)] A [Ibdpi(i) ^ lbdp 2 (j)] A [rbdpi(i) ^ Thdp^{j)] 

The bond condition provides a structure conserving requirement based on 
base-pairs. It can be extended by a bond checking such that the predicate of 

match(j —— i' ,j ~~ j') is given as 

[* — *'] A [j f] A [S^{i) = S 2 U)] A [5i(*') = ^2(j')] 

The matching conditions are applied to single bases or base-pairs so far. 
Now, we want to merge bases and base-pairs such that special relations among 
them are fulfilled. They provide a definition for matchings. We make use of a 
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transition type function on two positions i and i' which is +1, —1 or 0 depending 

p 

on whether i = i' + 1, i = i' — 1 or i i' . A path in an RNA is a sequence of 
positions i\ . . . ik, such that the bases S{ii) and S{ii+i) for I = 1, . . . , k — 1 are 
connected due to the bond conditions or due to the backbone of this RNA. 

Definition 4 (Matching/Matched Path). Let {Si, Pi) and {82, P2) be two 

RNAs and M a matching between them. An M-matching path is a list of pairs 
{ii,ji) ■ ■ ■ {ikjk) G M such that 

1. ii . . .ik is a path in {Si, Pi) 

2. ji . . .jk is a path in (S' 2 , P 2 ) 

3. for each l<l<k the transition types of {ii , ii+i) and {ji,ji^i) are equal. 

A matching is connected if there is a M -matching path between any two pairs 
in M . A path in only one of the RNAs consisting of only matched bases is called 
M-matched path. 












Fig. 3. Unpreserved bonds (backbone and secondary), a.) the backbone bonds i — l,i 

is not preserved, b.) the bond i i' is not preserved. The matching is indicated 

by blue and green nodes. In both cases, the the corresponding bases in the second 
structure are connected with nodes (in red) that are not part of the matching. 

Note the difference between matching paths and matched paths. A matched 
path is a path occurring in one structure, but there must not be necessarily 
a corresponding path in the other structure. Furthermore, the restriction of 
matching paths to some structure clearly produces a matched path. But the 
contrary is not true. There are matched paths, where the image of the path 
(under the matching) is not a path in the other structure. To clarify this, consider 
the simplest matched paths, which are edges (backbone connections or bonds) 
between matched bases. By definition, they are matched paths, but there might 
not be a matching path associated with. This happens for bases which mark the 
“ends” of the matching. The two cases for backbone edges and bond edges are 
shown in Figure 3. 

From the definitions of matchings, it is not clear whether they respect the 
backbone order, i.e. i < i' implies M{i) < M{i'). One can show that this holds 
for connected matchings. Since we will restrict ourself to matchings that preserve 
bonds later, and the proofs are simpler for these kind of matchings, we omit the 
proof for the general case here. We treat only the simple case for preserving the 
stacking order for general connected matchings. 
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Proposition 1. Let (ii, ji) . . . (ik,jk) G M be a matching path. Then the path 
preserves the relative stacking order, i.e. for alll < r < k we have stordp^(ii) — 
stordp^(ji) = stordp^{ir) — stordp.^{jr) ■ 

4 Bond Preserving Matching 

As Figure 3b indicates, a matched bond i i' which does not correspond 

to a matching path only occurs if we have a stem in the first structure that 
is matched to a multiple loop in the second structure (or vice versa). This is 
biologically unwanted, since it is very unlikely that this pattern could have been 
generated by evolution. For that reason, we are interested in matchings that 
preserve bonds. 

Definition 5 (Bond-Preserving Matching). A connected matching M is 

said to he bond-preserving if every matched bond in Pi or P 2 is also a matching 

Pi P2 

path, i.e. if {{i,j), {i' ,j')} C M and i i' , then j j' , and vice versa. 

In the following, we will consider bond-preserving matchings. We say that 
a connected, bond-preserving matching M is maximally extended, if there is 
no M' such that M C M' . We are interested in finding all (non-overlapping) 
maximally extended matchings. For this purpose, we need to show some prop- 
erties. We start with a proposition that allows us to decompose the problem of 
finding a maximally extended matching into subproblems of finding maximally 
extended loop matchings. The next proposition shows that the backbone order 
is respected. And the third proposition shows that if we do not exceed a loop, 
then maximally extended matchings (in this loop) are uniquely determined by 
one element. 

Proposition 2. Let i,i' G loopir ~~ s), and let M be a bond-preserving 
matching with {(*, j), (*^ jOI ^ Then any shortest matching path between 

(i,j) and {i',j') uses only elements of loop{r s) U {r, s}. 

Proof (Sketch). By contradiction. If there would be a path not satisfying this, 
then this path has to use a bond twice, either the same element or both el- 
ements of the bond. In the first case, we have an immediate contradiction to 
the minimality. In the second case, the bond is a matched bond. Since M is 
bond-preserving, one gets a shorter matching path by using this bond. 

Proposition 3 (Backbone Order). Let M be a connected matching, and 
(i,j), (f)j') G M . Then i < i' if and only if j < j' . 

The proof is given in the appendix. 

Proposition 4. Let i ~~ i' and j j' be two bonds, and let r G loop{i ~~ 
i') and s = loop{j j'). Let M,M' with i,i' ^ ran\{M) U rani{M') and 
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Proof (Sketch). Follows from Proposition 2 and the fact that if one does not 
use the closing bond of a loop, then there is only one unique connecting path 
between two elements of the loop. Hence, there cannot be any conflicting paths 
that are matched to different elements, and the two matchings must agree on 
the loop elements. 



5 Dynamic Programming Matrices 



We want to find all non-overlapping, maximally extended, bond-preserving 
matchings. For overlapping matchings, we choose the one with maximal size. 
If there are overlapping matchings of the same size, then only one is selected. 

We use a dynamic programming approach by filling a matrix M(r, s), with 
the following interpretation. We define an order ^ on elements as follows: 

\i<j if stordp (i) = stordp (j) 

i ^ j = < 

lstordp^(«) <stordp^(j) otherwise 



For pairs (r, s) and (fc, 1) we define (r, s) A (fc, 1) if and only if r ^ A:. Then 



M(r, s) =max 



\M\ 



M is a maximally extended matching 
with (r, s) G M and there is no 
(r', s') G M with (r', s') A (r, s) 



contains the size of an maximal matching. For simplicity, we assume the maxi- 
mum value over an empty set to be 0. Note that the size is stored only for the 
left-most, bottom-most pair (r, s) in M. For calculating M{r,s), we will addi- 
tionally need auxiliary matrices and which are defined as 

follows. 

Definition 6 (Auxiliary Matrices). Let Ri = {Si, Pi) and R 2 = {S 2 ,P 2 ) be 
two RNAs. Let r (resp. s) bean element of loop{i~~ i') (resp. loop{j ~~ j')). 
Then M’^°P{r, s) is the size of the maximal matching within the loops that contain 
(r, s), and is extended to the right or above (r, s), i.e. 



M(f°^{r, s) =max 



\M\ 



M C [i..i') X [j..j'] is a connected 

matching with (r, s) G M and 

V(r',s') G M\{{i,i'),{j,j')} : (r, s) ^ (r',s' 



) 



In addition, we define for every i,j such that i i' and j j' the matrix 
element M^^{i ~~ i' ,j j') to be the maximal matching that matches the 
bonds i ~~ i' and j j' , i.e. 



M^^{i —— i' ,j ~~ j') =max 



\M\ 



M C [i..i'] X [j..j'\ is a 
connected matching with 
(i,j) G M and (i',f) G M 
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In addition, we define M'^^{i i',j j') to be the maximal matching 

containing the right partners i' and j' of the bonds only, i.e. 



M'^^ii —— i' ,j ~~ f) =max 



\M\ 



M G [i+ l..i'] X [j + 
is a connected matching 
with {i' ,j') € M 



The first procedure calculates s) for a matching of two loops asso- 
ciated with the bonds i —— i' and j ~~ j' , given that and 

is already calculated for all bonds that are contained in the two loops. For cal- 
culating M^^{i —— i',j j'), we use additional auxiliary variables. The 

variable RDist stores the loop distance to the right-end of the loop. Thus, for 
given RDist, we consider elements r and s which have distance RDist to i' and 
/, respectively. Looking from the right end {i' ,f) of the loop this implies that 

r = leftff*"‘(z') and s = leftff "‘(j'). 

First, we need to know whether there is a matching connecting (r, s) with the 
right ends of the loop {i',j'): 



Reach^-‘^^‘^ {RDist) 



true if 3 connected matching 
M C [i..i'] X [j-.j'] with 
(r, s) G M and {i' ,j') G M 
false otherwise 



( 1 ) 



Since we don’t need the matrix entries any further, we only store the current 
value in the variable Reach. In addition, we store the size of the matching that 
used in the definition of ReacD (RDist) . If ReacK^ (RDist) is false, then 
we use the size of the last entry Reach'' (RDisf) with RDist' < RDist 
and Reach'' (RDist') = true. Technically, this is achieved by an array 
(RDist) with 



(RDist) =max 



\M\ 



M C \i..i'] X [j..j'] is a connected 
matching with (i',j') G M and 
y(r',s')€M\{(^,^'),(J,f)}■. 

(r, s) ^ (r',s') ^ 



(2) 



6 Recursion Equations 

The auxiliary matrices and arrays can be easily calculated via the following 
recursion equations. For s) we have 

M(f°P(r,s)= (3) 

M’’^(r, s) + M(f°^(r' -|- 1, s' -|- 1) if match(r r', s s') 

^ M''^(r, s) + M^.f°^(r -|- 1, s -I- 1) else if rbdp^ (r) A rbdp^ (s) A match(r, s) 

1 3- M^ff°^(r -I- 1, s -I- 1) else if match(r, s) 

0 otherwise 
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Note that if r and s are the left ends of the bonds r —— r' A s s', but 
the bonds are not matchable, then this case is covered by the third case. Here, 
r + 1 and s + 1 are not in the same loop as r, s. Therefore, we consider the case 
where the maximal matching extends to the next loop via the left ends of two 
bonds. This case is depicted in Figure 4. r and s do match, whereas the bonding 
partners r' and s' do not match. The currently considered loop is encircled. Since 
r + 1 and s + 1 in the contiguous loop do match, we know that we can calculate 
s) recursively by calculating + 1, s + 1). 




Fig. 4. Extension to next loop. 



The next step is to define the auxiliary arrays Reach'' -'''"^{RDist) and 
M'' (RDist) for a given loop. RDist is the distance to the right end of the 
closing bond. Consider the case where we want to match two loops associated 

Pi P2 

with the bonds i i' and j j' ■ Let len be the minimum of the two loop 
lengths, and 0 < RDist < len. Then 



Reach'- -'""^{0) 
^r_end(0) 



{ true if match(i', j') 
false otherwise 

fl if match(f', j') 

1 0 otherwise 



and 



For 1 < RDist < leumin, let r = left^^'®*(i') and s = left^^'®*(j') be the two 
positions with distance RDist to the right end of the considered loops. Then we 
obtain 



Reach'- -’""'{RDist) = Reach'" -'""'{RDist — 1) A match(r, s) 



M'- -'""'{RDist) = 



{M'-°°P{r,s) a Reach'- -'""'{RDist) 

yM'- {RDist — 1) otherwise. 



The matrix M'~''{i —— i',j ~~ j') can be easily defined by 

M'->'{i i'j f) = max I M'--'""' {RDist) 

0<Rdist<lenmi^ { ^ ’ 

For the matrix, there are two different cases as shown in Figure 5. In the 
first case a.), the extensions from the initial matching {i, i') to the right, and the 
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Fig. 5. The two possible cases for i' ,j j'). 



extension from to the left do not overlap, whereas they do overlap in the 

second case b). For the second case, we do not know exactly how to match the 
overlapping part. Hence, we have to consider all possible cuts in the smaller loop, 
marking the corresponding ends of the extensions from the left ends and from the 
right ends of the loop. The extensions from the right ends are already calculated 
in the matrix. Only for the definition of the recursion equation, we define 

and Reach'- analogously to equations (2) and (1), 
respectively. For the implementation, we need to store only the current values 
Mi_end Reach'-‘^’^^. 

Now let lerii^i' (resp lerijji) be |loop(i i')\ (resp. |loop(i i')|), and 
let leumin = T[im{leni^ii,lenjji}. Then we have 

= max + M^-'=^‘^{RDist)\ (4) 

0 < LDist < LeriMin J 

with rightp^'*®*('i) is not 
a left end of a bond 



where RDist = 



lerii^i' — LDist 

lerii^i' + {leui^ii 



if ICTlr^i^ — 
lerijji) — LDist else 



The condition rightp^'’’^'^ (i) is -not a left end of a bond guarantees that we do not 
cut in the middle of a bond, which is excluded since we are considering bond- 
preserving matchings only. The term (leuiy — leujj') in the second part of the 
definition of RDist is to compensate for the longer length of the first loop^. 

Finally, we consider the M{r,s) entries. Let r and s again be two bases of 



the loops defined i —— i' and j j' with distance RDist to the right loop 
ends i' and j', respectively. The values of M(r, s) and Mloop{r, s) are equal for 
all entries M{r,s) 0. M{r,s) is zero if there is some {r',s') -< (r, s) that is 
matchable. This leads to the following equation: 



^ In the case that i i' is the smaller loop, then overlapping of the left and right 
match extensions is already excluded by definition, and we do not need to compensate 
for it. 
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1: procedure Start-Loop- Walking(i, i', j, j') 

> Right loop ends of both RNAs 

2: reach — lNlT-Loop-MATRlCEs(i', j', i', /) 

3: (loop _size, loop _dist) := 'Loop-'WA'LKmG(i' , j' ,i, j,i' , j' , reach, true) 

> Only right loop end of first RNA 

4: k ;= i' 

5: while fc > i -|- 1 do 

6: k ~ leftflj (k) 

7: lNIT-LoOP-MATRICES(fc, j', i',j') 

8: Loop-Walking(A:, j' , i, j, false, false) 

9: end while 

t> Only right loop end of second RNA 
10: I ~ f 

11: while I > j + 1 do 

12: l--\eitR^{l) 

13: lNIT-LoOP-MATRICES(i', I, i' ,j') 

14: LooP-WALKlNG(i', I, i, j, i',f, false, false) 

15: end while 

16: return (loop _size, loop_dist) 

17: end procedure 



Fig. 6. Starting points of loop walking 



{ 0 if ^match(r, s) V match(leftfl^ (r), (s)) 

yReachJ'-^^'^(RDist) 

M^°°^(r, s) otherwise 

7 Pseudo-code 

The main procedure consists of two for-loops, each calling a base-pair from the 
first and second RNA, and performs the pattern search from inner to outer loops. 
It calls the procedure Start-Loop- Walking which initiates the calculation 

of all matrices except M^^(i —— i',j j') for two bonds i —— i' and 

j j', assuming that all matrix entries for loops above are already calculated. 
In addition, it calculates the loop length of the smaller loop and the distance 
of the two loop lengths (which is done in sub-procedure Calc-Remain-Loop- 
Len). 

The real calculation of these matrices is done in the sub-procedure LOOP- 
Walking, which traverses the loop from right to left (via the application 
of left.(-) function). The function Loop- Walking has two modes concerning 
whether we started the loop-traversal with both right ends i, i' or not. In the first 
mode (initiated in line of Start-Loop- Walking), we calculate also the array 
]\/[r_end^ and move the M(r,s) down to (i',f) for all (r,s) where Reach'' 
is true. This part is done by the subprocedure Loop-Reach. In the second 
mode, when Loop- Walking is called with only one right end (lines 8 and 14 of 
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Start-Loop- Walking), then we know the right ends cannot be in any match- 
ing considered there. Hence, we may not calculate the array. 

The subprocedure Mloop-Recursion is just an implementation of recursion 
Equation (3) for The sub-procedure Init-Loop-Matrices just initializes 

the matrices for the starting points. In most case, the initial values are 0 (since we 
cannot have a match if we do not start with the right-ends due to the structure 
condition). The only exception is if we start with both right ends, and these 
rights ends do match. In this case, we initialize the corresponding matrix entries 
with 1. The sub-procedure Init-Loop-Matrices is listed in the appendix. 

1: procedure Loop- WALKiNG(r, s, i, j, i', j', reach, rip/it_ends) 

2: RDist = 0 

3: while r > i A s > j do 

4: r' := r; s' := s 

5: r := left ijj(r'); s := left_R 2 (s'); RDist = RDist + 1 

6: if BASE-MATCH(r, s) V BOND-MATCH(r^, r, s'', s) then 

7: MLOOP-RECURSlON(r’', r, s^, s') 

8: M{r,s) ~ M^°°'‘'{r,s)-, M{r',s')~0 

9: if right _ends then 

10: Loop-REACH(r, s, i,j, i' ,j' , reach, RDist) 

11: end if 

12: else 

13: M(('°^(r, s) := 0; M(r, s):=0; reach ~ false 

14: if right _ends then 

15: (RDist) := (RDist - 1) 

16: end if 

17: end if 

18: end while 

19: if right _ends then 

20: return CALC-REMAiN-Loop-LEN(r, s, i, j, RDist) 

21: end if 

22: end procedure 

Fig. 7 . The procedure loop walking is going from one base to the next 

The next step is to calculate M^^(i —— i',j j'), which is done 

by the procedure Loop-Matching. Loop-Matching is called after Start- 
Loop- Walking is finished. In principle, this is just an implementation of the 
recursion equation (4). Since we do not want want to maintain another ar- 
ray we store only value for the current LDist in the variable 

]\/[i_end^ The procedure maintains three neighbouring cells (r*,s*), (r, s) and 
(r’',s’'). (r’‘,s'‘) correspond to LDist — 1, and (r,s) to LDist. The cut will be 
between (r, s) and (r’',s’'). The sub-procedure Mlend-Recursion is in prin- 
ciple only an implementation of the recursion equation for under the 

condition that that is true. As it can be seen from the definition 

of in Equation (2), the recursion equation under this condition is in 

principle analogous to the recursion equation for given in Equation (3). 
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procedure Loop-MATCHlNG(i, i' ,j, j' ,i_i' _lens, lens _dist) 

LDist := 0 

if BOND-MATCH(i,i', j, j') then 

Q. Reach}-‘^^'^ := true 
r’’ := i\ r := i; r* := i 
:= j; s := j; s* := j 

while r"" < i' A < j' A Reach ‘ := true do 
J.. y. := right (r^); 

s‘ := s\ s := s’"; s’" — right (s’"); 

if BASE-MATCH(r, s) V BOND-MATCH(r’" , r, s'", s) then 

J^i_end ^ MLEND-RECURSlON(r*, r, r'", s*, s, s'", M'-®"''*) 
else 

Reach‘-^'^‘‘ := /oZse 

end if 

if A ->BOND-MATCH(r*, r, s*, s) then 

Fill-Mbb(i, i', j, j', LDist, i_i' _len, lens_dist) 

end if 

LDist LDist + 1 

end while 
else 

:=0 

end if 

end procedure 



Fig. 8. Calculation of 



The maximally extended matchings are finally calculated from the M(r,s) 
matrix by an usual traceback. The space complexity of the algorithm is 0{nm). 
The time complexity is Oinm) for the following reason. Every pair (r, s) with 
1 < r < l^il and 1 < s < |S'2| is considered at most twice in Start-Loop- 
Walking and LOOP- Walking, with an 0(1) complexity for calculating the 
corresponding matrix entries. Similarly, every pair (r, s) is considered at most 
twice in Loop- Walking. Since there are 0{nm) many pairs (r, s), we get a total 
complexity of 0(nm). 



8 Conclusion 

We have presented a fast dynamic programming approach in time 0{nm) 
and space 0{nm) for detecting common sequence/structure patterns between 
two RNAs given by their primary and secondary structures. These patterns 
are derived from exact matchings and can be used for local alignments ([1]). 
The most promising advantage is clearly to investigate large RNAs of several 
thousand bases in reasonable time. Here, one can think of detecting local se- 
quence/structure regions of several RNAs sharing the same biological function. 
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Abstract. The paper deals with the problem of finding a tandem scat- 
tered subsequence of maximum length (LTS) for a given character se- 
quence. A sequence is referred to as tandem if it can be split into two 
identical sequences. An efficient algorithm for the LTS problem is pre- 
sented and is shown to have 0{n^) computational complexity and linear 
memory complexity with respect to the length n of the analysed se- 
quence. A conjecture is put forward and discussed, stating that the com- 
plexity of the given algorithm may not be easily improved. Finally, the 
potential application of the solution to the LTS problem in approximate 
tandem substring matching in DNA sequences is discussed. 



1 Introduction 

A perfect single repeat tandem sequence (referred to throughout this article sim- 
ply as a tandem sequence) is one which can be expressed as the concatenation 
of two identical sequences. Tandem sequences are well studied in literature. The 
problem of finding the longest tandem substring (the longest subsequence com- 
posed of consecutive elements) of a given sequence was solved by Main and 
Lorentz [7], who showed an O(nlogn) algorithm, later improved to 0{n) com- 
plexity by Kolpakov and Kucherov [3] . A lot of attention has also been given to 
finding longest approximate tandem substrings of sequences, where the approxi- 
mation criterium of the match is given either in terms of the Hamming distance 
or the so called edit distance between the substring and the sequence. 

This paper deals with a related problem, concerning determining the longest 
tandem subsequence (which need not be a substring) of a given sequence (the 
so called LTS problem). A formal definition of LTS is in Subsection 2.1 and 
an efficient algorithm which solves LTS in 0{n^) using 0{n) space is outlined 
in subsections 2.1 and 2.2. This result is a major improvement on the hitherto 
extensively used naive algorithm, which reduces the solution LTS to n itera- 
tions of an algorithm solving the longest common subsequence problem (LCS), 
yielding 0{n?) computational complexity. 

Finally, in Section 3 we consider the application of LTS as a relatively fast 
(but not always accurate) criterium for finding approximate tandem substrings 
of sequences and judging how well they match the original sequence. 
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2 An Efficient Algorithmic Approach 
to the LTS Problem 

2.1 Notation and Problem Definition 

Throughout the paper, the set of nonnegative integers is denoted by N. Sets of 
consecutive elements of N are referred to as discrete intervals and denoted by 
the symbol (i,j), which is equivalent to {i, i + 1, . . . , j}. 

Definition 1. A character sequence s of length n over nonempty alphabet fl 
is a function s : (l,n) — > f2. The length |s| of sequence s is the number of 
elements of the sequence, |s| = n. The symbol Si, where 1 < i < n, is used to 
denote s{i), the i-th element of sequence s. 

Sequence s is expressed in compact form as s = [siS 2 . . . Sn]. 

Definition 2. Sequence s of length |s| = n is called a tandem sequence if n is 
an even number and Vi<i <„/2 Si = 

Definition 3. Sequence t, |t| = k, is called a subsequence of sequence s, |s| = n, 
if it is possible to indicate an increasing function h : (l,fc) ^ (l,u), such that 
Vi<i<fe ti = Sh{i) ■ This relation between sequences t and s is written in the form 
t C s. 

Definition 4. The Longest Tandem Subsequence problem (LTS for short) for 
a given sequence s is the problem of finding a tandem sequence t such that t C s 
and the length of sequence t is the maximum possible. 

The suggested approach to the LT S problem reduces LT S for sequence s to 
the problem of determining the longest common subsequence of two sequences 
not longer than s. 

Definition 5. The longest common subsequence LCS{p,q) of sequences p and 
q is a sequence t, such that t C p and t C q, of the maximum possible length. 

The LTS problem for sequence s = [siS 2 . . . s„] can be solved by means of an 
algorithm consisting of the following two stages. 

Algorithm 1. Longest Tandem Subsequence 

1. Determine an index I, 1 < I < n for which |LC'S'([si . . . s;], [s/+i . . . s„])| 
takes the maximum possible value. 

2. Compute sequence t = LC'S'([si . . . s;], [s/+i . . . s„]) and return as output. 

The computational time and memory complexity of Algorithm 1 is dependent 
on the implementation of Stages 1 and 2. Both these steps will be analysed 
individually and shown to be solvable in O(n^) time using 0(n) memory. 

Stage 2 of Algorithm 1 can be implemented using Hirschberg’s approach 
[1], who presented an algorithm which, given two character sequences p and q 
(IpI = m, |g| = k), computes the sequence LCS{p, q) in 0{mk) time and requires 
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0{m+k) memory. The strings whose longest common subsequence is determined 
in Stage 2 of Algorithm 1 have a total length of n, which closes the analysis of 
the complexity of this stage of the algorithm. 

An efficient approach to Stage 1 of the algorithm is the subject of consid- 
eration in the following subsection. Since the problem of finding the index I 
for sequence s is of some significance and may even in certain applications be 
considered separately from the LTS problem (i.e. related to DNA sequencing, 
Section 3), it is useful to call it by its own name, referring to it as the LTSsplit 
problem. 



2.2 A 0{n^) Time Algorithm for the LTSsplit Problem 

Definition 6. LTSsplit is the problem in which, given a sequence s (\s\ = n), 
we have to determine an index I, 1 < I < n, such that the length of the string 
LCS{\si ■ . . s/], [s;+i . . . s„]) is the maximum possible. 

The suggested algorithmic solution to the LT S split problem is based on dynamic 
programming. In order to describe the lengths of the analyzed subsequences, it 
is convenient to define the family of functions fk, for 1 < fc < n. For a given k, 
function /^ : Z x Z — > N is given as follows: 

( \LCS{[si . . . Sj], [sj . . . Sk])\, for l<i <j <k 
fk{i,j) := < 0, in all other cases when i,j > 0 (1) 

I —1, when i,j <0 



Index I may be expressed in terms of the function fk by the following set of 
conditions: 

I fn{l, l+l) = maxr: l<r<n {fn{r, r + 1)) 

[1 < I < n ^ ^ 

The values of function /fe(i, j), for l<i<j<k<n, can be expressed using a 
simple recursive formula: 



fk{i,j) 



max {fk{i - 1, j), fk-i{i,j)} when s* yf Sk 
fk-i{i - l,j) + I when Sj = Sfc 



(3) 



In order to express values of function fk using values of function fk-i it is helpful 
to introduce the index 7 fe(f), defined as the largest value of r such that r < i 
and Sr = Sfc, or 0 if no such value exists. From formula (3) we have 



fk{i,j) = max {/fc_i(7fc(i) - l,j) -f 1, fk-i{i,j)} (4) 

Let us now consider the family of functions dk ■ N+ x N ^ {0, 1}, defined for 
1 < A: < n as follows: dk{i,j) = fk{i,j) — fk{i—f,j)- For the range of arguments 
f ^ i < j ^ k the value of function fk may be expressed as 



fk{i,j) = '^dk{r,j) (5) 

r—1 

A convenient characterization of function d is given by the following property. 
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Property 1. Let \ < i < j < k. The following statements hold 

1. Suppose that dk~i{i,j) = 1- 

Then dk(ij) = 0 iff V^^(i_i)<r<i dk-i{r,j) = 0. 

2. Suppose that dk-i{i,j) = 0. 

Then dk{i,j) = 1 iff Si = Sfe and dk-i(r,j) = 1. 



Proof. Both claims of the property are proven below. 

(Claim 1.) Let us assume that dk-i{i,j) = 1 and let d = X)r= 7 fc(i-i) 
and w = fk-i{lk{i — 1) — Ij j)- By using formulae (5) and (4) we obtain fk-i{i — 
1, j) = w + d and fk{i — 1, j) = max{ru+ 1, w + d} = w + max{l, d} respectively. 
Moreover, since /fe-i(7fe(*) - 1, j) + 1 < /fc-i(*- l,j) + 1 = /fe-i(*,j) = w + d+l, 
by formula (4) fk{i,j) =w + d+l. Therefore fk(i,j) = fk{i ~ 1, j) iff d = 0. 

( Claim 2.) First, assume that dk-i{i,j) = 0 and Si = Sk- Let w and d be defined 
as in the proof of claim 1. Acting identically as last time, we get fk-i{i — 1, j) = 
w+d and fk{i-l, j) = w+max{l, d}. By formula (3) fk(i,j) = /fc_i(i-l, j) + l = 
w + d + 1. Therefore fk{i,j) = /fc(* — j) + 1 iff d > 0. 

Let us now assume that dk-i{i,j) = 0 and Si yf Sk- Suppose that fk{i,j) = 
fk(i-l,j) + 1. From formula (3) we have fk{i,j) = fk-i{i,j) = /fe_i(i- 1, j) < 
fk{i — 1, j), a contradiction. 

At this point it is essential to notice that function dk has another interest- 
ing property, which is useful in the construction of an efficient algorithm for 
LTS split. 

Property 2. Given i and k, 1 < i < k < n, the value of dk{i,j) is equal to 1 iff 
j G {i+ 1, ak{i)), for some function Ofc : N ^ N. 

For an illustration of the function Ofc, see Fig. 1. 

Proof. For given i, the observation that the set S' = {j : t < j < fc A dk{i,j) = 1} 
is a discrete interval whose left end equals z -I- 1 is a consequence of the property 



PkHIk {h j)} 



(*) ;)} 
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Fig. 1. Values of functions fk, dk and ak for sequence s beginning with BABBCA and 
k — & 
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of monotonicity. More formally speaking, the proof proceeds by induction with 
respect to k. 

If fc = 1 then S is empty. 

Let fc > 1 and suppose that the inductive assumption holds for fc — 1. It 
suffices to show that if dk(i,j) = 1, then for an arbitrarily chosen t, i < t < j, 
we have dk{i,t) = 1. We will consider two separate cases. 

First, let dk{i,j) = 1 and dk-i{i,j) = 1. By claim 1 of Property 1, for some 
r, 7 fc(i — 1) < r < i, we obtain dk-i{r,j) = 1. By the inductive assumption we 
conclude that dk-i{r,t) = 1 and dk-i{i,t) = 1. The equality dk{i,t) = 1 is a 
conclusion from claim 1 of Property 1 . 

Now, suppose dk{i,j) = 1 and dk-i{i,j) = 0. By claim 2 of Property 1 we 
have Si = Sk and for some r, 7 fc(i — 1) < r < i, we have dk-i{r,j) = 1. As in 
the previous case, dk-i{r,t) = 1. The equality dk{i,t) = 1 is a conclusion from 
either claim 1 or claim 2 of Property 1, depending on whether dk-i(i,t) = 1 or 
dk-i{i,t) = 0, respectively. 

By definition of function dk, dk{i,j) = 0 when j < i, which completes the 
proof. 

As a direct conclusion from Property 2, the values of function ak uniquely 
determine all values of function dk ■ It is possible to consider a unique represen- 
tation of matrix Dk = {dk{i,j)} of dimension n x n (1 < < n) in the form 

of the column of numbers Ak = {afe(*)} of dimension n (1 < i < n). 

Theorem 1. There exists an algorithm solving the LTSsplit problem for a given 
sequence s of length n in 0{n?) time and using 0{n) memory. 

Proof. The algorithm for solving the LTSsplit problem consists of the following 
steps: 

1. For all fc, 1 < fc < n, compute the column Ak by modifying column Ak-i, 
making use of Property 2. 

2. Determine an LTSsplit index I from the values of column A„ using the 
following equation (directly inferred from the definitions of A„, d„, fn)' 

fn(i,i+ 1) = |{p: 1 <p< i A a„(p) > i -I- 1} (6) 

using condition (2) to guarantee the suitable choice of 1. 

The linear memory complexity of Steps 1 and 2 of the algorithm is evident. It is 
also obvious that Step 2 of the algorithm can be performed in O(n^) time (in fact. 
Step 2 can even be implemented with 0{n) running time, yet this is irrelevant to 
the proof). It now suffices to present a O(n^) approach to the problem of finding 
the column A„ in Step 1 of the algorithm. 

To clarify this step, we will consider a geometrical presentation of column Ak 
as a set Pk = {pi, . . . ,pk} of k closed horizontal segments of the plane, where 
the segment pi has vertical coordinate i, left horizontal coordinate 0 and right 
horizontal coordinate ak{i). For some k, consider a pair of values i,j, where 
^ i < j S k. By definition of column Ak and set Pk, the value of dk{i,j) is 1 
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iff the point (i,j) belongs to some segment of Pk- We define the visible section of 
segment pi at height r, 0 < r < i, as a, segment q Q Pi whose projection ttx to the 
horizontal axis fulfills the condition: 7Tx{q) = T^x{Pi) \ Ut=r^a:(Pt)- ^°r the sake 
of completeness of the definition, the visible section of any segment at height 0 
is assumed to be empty. The following corollary is a direct conclusion resulting 
from the analysis of Property 2. 

Corollary 1. Given set Pk-i, the set Pk may he constructed from Pk-i by per- 
forming the following transformations: 

1. for all i, 1 < i < k, such that Si = Sk, remove segment pi from the set and 
insert a segment with vertical coordinate i and horizontal coordinates 0 and 
k. 

2. for all i, truncate the right part of segment pt by removing the visible section 
of Pi at height 7 fc(f - 1) of Pk-i from p^. 

An example of the transformation of set Pk-i into set Pk is presented in 
Fig. 2. Since the operations described in Corollary 1 only modify the right end- 
points of segments from Pk , the described procedure may be considered in terms 
of introducing appropriate modifications to the column Ak- The transformation 
from Ak-i to Ak can be performed in 0{k) time in two sweeps, once to detect 
the indices i for which Si = Sk and update the values as in Step 1 of the transfor- 
mation, the other ~ to perform Step 2 of the transformation. Thus the column 
An can be obtained in 0{n^) operations and the proof is complete. 

In order to formalise the adopted approach, a complete implementation of 
both steps of the algorithm for LTS split is given below. To simplify the code, 
the two sweeps corresponding to Steps 1 and 2 of the transformation from Ak-i 
to Ak are performed in slightly modified order, which does not influence the 
correctness of the algorithm. 



a) h) c) 




Fig. 2. An illustration of the transformation of set Pk-i into set Pk for a sequence 
beginning with ABCBBCABABAC (k = 12) 

a) the set Pk-i b) the set after Step 1 of conversion (newly added segments are 
marked with a bold dashed line; segment fragments to be truncated are denoted by a 
dash-dot line) c) the set Pk 
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algorithm LTSsplit (s : array l..n of character) : integer; 
var a,c,k,l,t : integer; 

A := [0, . . . , 0] : array l..n of integer; 
begin 

{(*) Compute column Ak for k = 1, 2, . . . , n} 
for k in (1,2,..., n) do 

for a in (/c — 1, fc — 2, . . . , 1) do 
if s[a] = s[A:] then begin 
t := A[a\, 
c := a + 1; 

{Perform Step 2 of Corollary 1 using a downward sweep technique} 
while {t < k) and (c < k) do begin 

{Trim the segment corresponding to Pc to length t, 
removing the section of it which is visible at height a} 

(A[c\,t) '■= (min{A[c], t|, max{A[c], t}); 
c := c+ 1; 
end; 

A[a\ := k] 

end; 

{(it*) Calculate index I using the column A„} 
c := 0; 
l:= 1; 

for k in (1, 2, . . . , n — 1) do begin 
t := 0; 

for a in (1,2,..., k) do 
if A[a] > k then t := t + 1; 
if t > c then begin 
c := t; 

I := k] 
end; 
end; 

return 1; 
end. 

2.3 Remarks on the Efficiency of the Algorithm for LTS 

The approach to the LT S problem described in Algorithm 1 decomposes it into 
the LTSsplit and LCS subproblems, both of which can be solved using O(n^) 
algorithms with a low coefficient of proportionality (similar for both algorithms 
on most system architectures). 

The existence of faster algorithms for the problem appears unlikely, since 
no algorithm with o(n^) complexity is known for LCS in the case of general 
sequences. This may formally be stated as the following conjecture. 

Conjecture 1. It is believed that the computational complexity of an algorithm 
solving the LTS problem is never lesser than the complexity of an optimal 
algorithm solving LCS for general sequences. 
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3 Final Remarks 

One of the major issues of DNA string matching deals is the problem of finding 
the longest approximate tandem substring of a given DNA sequence. Formally 
speaking, a longest approximate single tandem string repeat in sequence s is 
defined as a substring p C s (a subsequence composed of consecutive elements 
of s) of maximum possible length, which can be split into two similar substrings 
Pi) P 2 [4, 5]. The criterium of similarity may have varying degrees of complexity. 
Typically described criteria include the Hamming distance, the Levenshtein edit 
distance (elaborated on in [6]), as well as more complex criteria (expressing the 
distance in terms of weights of operations required to convert one sequence to 
the other, [2, 8]). 

In some applications, the criterium used to describe the similarity of pi and p 2 
is the length of LCS{pi,p 2 )- Given the sequence p, the LTSsplit algorithm can 
be applied to find the best point for splitting p so as to maximise LC'S'(pi,p 2 ). 
In consequence the output of the algorithm solving LTS directly leads to the 
answer to the two most relevant problems, namely whether p can be split into 
two similar fragments and, if so, what those fragments are. 
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Abstract. The k-NN classifier (k-NN) is one of the most popular docu- 
ment categorization methods because of its simplicity and relatively good 
performance. However, it significantly degrades precision when ambiguity 
arises - there exist more than one candidate category for a document to 
be assigned. To remedy the drawback, we propose a new method, which 
incorporates the relationships of object-based thesauri into the document 
categorization using k-NN. Employing the thesaurus entails structuring 
categories into taxonomies, since their structure needs to be conformed 
to that of the thesaurus for capturing relationships between themselves. 
By referencing relationships in the thesaurus which correspond to the 
structured categories, k-NN can be drastically improved, removing the 
ambiguity. In this paper, we first perform the document categorization 
by using k-NN and then employ the relationships to reduce the ambigu- 
ity. Experimental results show that the proposed approach improves the 
precision of k-NN up to 13.86% without compromising its recall. 



1 Introduction 

Recently with the advent of digital libraries containing a huge amount of doc- 
uments, the importance of document categorization is ever increasing as a so- 
lution for effective retrieval. Document categorization is the task of assigning a 
document to an appropriate category in a predefined set of categories. Tradi- 
tionally, the document categorization has been performed manually. However, 
as the number of documents explosively increases, the task becomes no longer 
amenable to the manual categorization, requiring a vast amount of time and 
cost. This has led to numerous researches for automatic document classification 
including bayesian classifiers, decision trees, k-nearest neighbor (k-NN) classi- 
fiers, rule learning algorithms, neural networks, fuzzy logic based algorithms 
and support vector machines [1][4][5][7][8][9][10][11]. For the classification, they 
usually create feature vectors from terms frequently occurring in documents and 

* This work was supported by Korea Science and Engineering Foundation(KOSEF) 
Grant No. R05-2003-000-11986-0. 



A. Apostolico and M. Melucci (Eds.): SPIRE 2004, LNCS 3246, pp. 101-112, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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then repeatedly refine the vectors through document learning [12]. However, 
since they only train the classifiers with the vectors, they usually incur ambigu- 
ity when determining categories, which significantly degrades their precision [6]. 
For example, suppose there is a document which reviews overall features of dis- 
play equipment. Apparently, most terms occurring in this document would be 
related to the categories LCD, CRT and PDF. Simply considering document 
vectors alone without considering relationship between categories, they fail to 
capture the fact that LCD, CRT and PDF are commonly the examples of the 
display equipment, suffering from the ambiguity between the categories. 

To tackle the problem, we propose a new method to improve the precision 
of k-NN by incorporating the relationships of object-based thesauri into docu- 
ment categorization using k-NN. The reason we choose k-NN is that it shows 
relatively good performance in general in spite of its simplicity [2] [12] [13]. Em- 
ploying the thesaurus entails structuring categories into taxonomies, since their 
structure needs to be conformed to that of the thesaurus for capturing relation- 
ships between themselves. By referencing relationships in the thesaurus which 
correspond to the structured categories, k-NN can be improved, removing the 
ambiguity. In this paper, we first perform the document classification by using 
k-NN and then, if a document is to be classified into more than one category, we 
employ the relationships of the thesaurus to reduce the ambiguity. Experimen- 
tal results show that this method enhances the precision of k-NN up to 13.86% 
without compromising its recall. 

This paper proceeds as follows. In Section 2, we review research related to 
our classification method. Section 3 describes a way of hierarchical classification 
employing the object-based thesaurus. Section 4 shows experimental results, and 
conclusion and future researches follow in Section 5. 

2 Preliminaries 

2.1 Document Classification Based on k-NN 

As in most research work, we use the vector space model for representing a doc- 
ument. Let D be a set of documents and T be a set of distinct terms that appear 
in the set of documents D. d is represented by a vector of term weights [12]. 

To get the weight of a term ti G T, we use tfidf weighting scheme that is 
generally used. Also, term weights are normalized by cosine normalization [12]. 
Let Wki be a normalized weight with respect to dk & D and ti. Based on this 
weighting scheme, the following vector for dk can be constructed: 



term{dk) = {h/wki,t 2 /wk 2 ,- ' ' Cm/wkiri} 
where \T\ is the cardinality of T. 

To classify a document d, k-NN selects its similar neighbors among the train- 
ing documents and uses the categories assigned by the neighbors to judge the 
category of d [2] [5] [13]. To explain how k-NN determines the category of d, we 
provide a definition. 
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Definition 1. Let Near{d) be the set of k nearest neighbor documents with re- 
spect to d and g Near{d). Assignment of to Ci G C= {ci, C 2 , • • • , Cm} 

is denoted by 

(jNear\ _ / if d^ belongs to Ci 

[0, otherwise 

Since each is a document similar to d, we can estimate the weight 

with which d may belong to Ci by calculating = St=i Ci{d^‘^°'^). The weight 
may be viewed as the degree of proper category to which d belongs. 

Example 1. Let k=ll and Ci G {ci, C 2 , • • • , Cs}. From k-NN, suppose we obtain 
Ci{d^‘^°‘'~), j = 1, 2, • • • , 11 and as shown in Fig. 1. 




Fig. 1. Document classification by k-NN 

Since is the largest value, k-NN simply selects C 4 as the target category for 
d. But, what if d's actual category turns out to be ci after appropriately exploit- 
ing relationships between ci and C5? In the following section we briefly explore 
the object-based thesauri that offer such relationships used in our method. 

2.2 Object-Based Thesauri [3] 

The semantic interpretation of the object-based thesaurus may be identified from 
the two perspectives: object perspective and relationship perspective. 

In the object perspective, in contrast to conventional thesauri treating nodes 
as terms, our thesaurus views the nodes as objects taking the terms as their 
names. An object may be an object class or an instance of some object classes. 
It is taken as an instance, if it can be an example of object classes - for example, 
in Fig. 2, “TFT-LCD” is an instance, since it is an example of “LCD.” It may 
take other more specialized objects as its sub-classes or instances, depending on 
whether they can be its examples or not. Since the direct or indirect sub-classes 
of an object class c can, in turn, have their own instances, c may form a class 
hierarchy together with the sub-classes and their instances. 
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In the relationship perspective, the object-based thesaurus refines traditional 
relationships, BT(Broader Term)/NT(Narrower Term) and RT(Related Term) 
into generalization/specialization, composite, and association according to their 
semantics. Fig. 2 is an example thesaurus to be used in this paper. To distinguish 
object classes for instances, we call the classes as concepts. Additionally, when 
the objects form a class hierarchy rooted at c, we call it concept hierarchy and 
call c top level concept respectively. Refer to [3] for the detailed description of 
this thesaurus including various strategies to effectively construct it. 




Fig. 2. Example of the object-based thesaurus 



3 Document Classification with k-NN and Thesauri 

In this section we demonstrate that generalization, composite, association and 
instance relationships in the object-based thesauri may serve to enhance the 
precision of k-NN. 

3.1 Structuring the Set of Categories 

A subset C of concepts or instances in a thesaurus Th is defined as follows. 

Definition 2. Let C = - h I** ^ I~^,l = 1,2, ■ ■ ■ ,n} where n is the max- 
imum concept level of Th. Then for a category is the fith top- 

level concept and is the fjth sub concept of in the level I — 1, 

2<l <n. 

If we need to emphasize that Ci^ is a top-level concept, we denote it by 
c*°^. Since Th also has instances for each ^e denote the instance set 

by Fig-3 depicts an example of C = {ci, C2, cn, • • • , C212}, which is 

structured like Th. 

Since we use concepts in C as categories, from now on we call concepts and 
categories interchangeably. But we don’t adopt any instance in /(c) for c G C 
as a category, since not only the instances are too specific to be categories but 
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Fig. 3. Category structure conforming to Th 



also their number tends to be large in general. We hence use {/(c) | c G C} as a 
local dictionary of c characterising c. 

For d G D, a, category c G C has the implication property from generalization 
that if (/)=!, then Ciji 2 (d)=l. This means that if d is included in the 
category Cipjisi then d should also be included in its super category We 
formally define this property in the following definition. 

Definition 3. For G C, if a^i^...i^{d) = 1, then Cip^-u-i (<^) = 1,2 < 

I < n. 

The following proposition generalizes this property. 

Proposition 1. For G C, if Ci^i^...^(d) = 1, then a^i^-.i^id) = 1, where 

s = 1,2, •••,?- 1,2 <; < n. 

Proof. If Cip 2 (d)=l, then Ci^(d)=l for s = 1 by Definition 3. Suppose the fol- 
lowing holds as the inductive hypothesis: 

If 0 pi 2 -"U_i (^) — 1, then (^) — 1, ^ — 1, 2, * * * , ^ 2, 3 ^ ^ n. (1) 

Since if Cid 2 - -ii{d)=l, then Cid 2 - -ii-i{d)=l by Definition 3, we can conclude 
this proposition holds by using (1). 

According to Proposition 1, once a document is assigned to a lower level 
category, our algorithm would automatically assign it to its direct or indirect 
super categories along the corresponding hierarchy. However, this automatic as- 
signment could incur a problem that weights of the higher level categories are 
always larger. This problem is formally specified in Proposition 2. 

Proposition 2. Let be a weight of Cid 2 ---ii G C , 1 < I < n. Then the 

following holds; if > 0, then n • • • , Z — 1. 

Proof. Let’s apply Proposition 1 to Wid 2 ...ii = Then since 

if Cip 2 -ii(t^f“’') = 1, then (<^7“’') = 1, we can conclude Wid 2 -i, > 

Wid 2 ---ii for s = 1,2, ■■■ ,l — 1, 2 < I < n. 

Generally, since assigning documents to the lower level categories is more 
useful classification, in the subsequent section we develop a notion called reduced 
candidate category set not to neglect the categories when assigning. 
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3.2 Reducing Ambiguity with Thesaurus 

We begin this section by defining a set of candidate categories for d G D. 

Definition 4. A set of candidate categories with a predefined threshold value 
a is denoted by 

Additionally, we denote the cardinality of Ca{d) by |C'o,((i)|. Ca{d) is denoted 
by simply C{d) when a is unimportant. 

For notational convenience, we call Sup{c) as the super category set of c G C. 

Definition 5. The super category set of G C is defined by 

Sup(^Ci^22---ii ) — ^ 1 ^ ^ ^ ^ 1 } ? 

Sup{cl°^) = {c*°^} when / = 1. 

We now define a reduced candidate category set Cn^d) as the minimal set of 
candidate categories for d. 

Definition 6. A reduced candidate category set is defined by 

Cfl(d) = C{d) - UjL;^S'Mp(cjii 2 -ii) where G C{d). 

Cn{d) is the minimal set curtailing every candidate category in C{d) de- 
ducible from each category of the most lowest level. The minimal property of 
Cn{d) can remove some ambiguity to which C{d) may lead otherwise - for ex- 
ample, the ambiguity incurred from C{d) = {cnjCm} may be removed by 
curtailing the higher category cn from C{d): Cn{d) = {cm}. Hence, we refer 
to ambiguity only when |C'/j((i)| > 2 from now on. 

We first deal with definite category assignment and then develop a way of 
resolving the ambiguity appearing in each case by exploiting relationships avail- 
able in the object-based thesaurus. Definition 7 shows a way for any c G Cn{d) 
to systematically use the related category set {c'|c' ^ CR{d)}. Each c' ^ Cn{d) 
may act a crucial role in resolving the ambiguity, though it is not selected as a 
candidate category of d. 

Definition 7. Let C (d) = Ca'{d) — Ca{d) for another threshold value a' satis- 
fying a > a' > 0 and call it second candidate category set for d. Then a reduced 
second candidate category set is defined by 

Cr (d) = C (d) - UjL;^S'Mp(cjii 2 - u) where CjiJ 2 - u G C' (d). 

To develop our algorithm, consider a document d, which deals with a topic 
about Digital TV, LCD and PDF. The following structured category set depicted 
in Fig. 4 involves them as categories. 

Example 2. Suppose we get iCcm = 4, Wcua = 4, Wcan = 1, with fc = 9,a = 4 
and 0=1. Then C4{d) = {ci, cn, cm, C112}, Cfi(d) = {cm, C112} and Cr (d) = 
{C21l}. 
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Fig. 4. Example of the structured category set 



Our algorithm tries to select one between LCD{c\ii) and PDP{c^l 2 )^ de- 
pending on which one has composite or association relationships with Digital 
TV. Though Digital TV does not belong to Cn{d), it may act as a crucial clue 
to determine a correct category, due to the fact that DigitalTV{d) € Cr (d). 
We now provide the following definition for elaborating this selection process. 

Definition 8. r{ci) = comp{ci)Uassoc{ci) where comp(ci) and assoc{ci) return 
the set of categories related to Ci with composite and association relationship 
respectively. They are specified in the thesaurus Th. 

Definition 9. Let Wacj /comp and Wacj / clssoc denote weights estimating the 
weight of composite and association relationship between Cj G r(ci) and Ci re- 
spectively. 

The following definition is used to adjust weight of each a € CR(d) by r(d). 

Definition 10. Let Ci € CR{d) and Cj G Cr (d) for d G D . Then weight of Ci 
considering r(ci) is calculated by 

+ ^Cj(^cofnp{ci)WciCj /comp X Wcj + S Cj^assoc{a)WciCj / CLSSOC X Wcy 

Example 3. In Example 2^ w^rcd ''^pdp LCD, PDP G CR(d) exploiting 
DigitalTV G comp{LCD) may be calculated as follows. 

^/cD =4-1- w LCD, DigitalTV /comp X WRigitalTV = 4 -|- 0.8 X 1 = 4.8. 
'^PDP =4-1- WPDP, DigitalTV /comp X W DigitalTV = 4-|-0x 1 = 4. 

Based on > Wpjjp, we may decide d belongs to LCD rather than PDP. 

Unfortunately, if this process fails to select a unique category, we need to use 
a local dictionary of CR^d). In the following definition we introduce a data set 
called local dictionary set gathered to further differentiate the categories. 

Definition 11. A local dictionary set for c G CR(d) is defined by ld(c) = 
{c} U assoc(c) U comp{c) U sym{c) U 7(c) U sym{I{c)) where sym{c) is the set of 
c’s synonyms. 
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Since this local dictionary set characterizes each associated category c G 
Cfi{d), it may be viewed as another feature vector of c. 

Definition 12. Let term{d) = {ti/wi,t 2 /w 2 , - ‘ ‘ )^|T|/^i'|T | \ti & T} be the term 
vector for d. Then we define ld{c) □ term{d) = {ti/wi\ti G ld{c) n T}. 

Once ld{c) r\term{d) is obtained, we can resolve ambiguity among categories 
in Cn{d) by computing the weight wlf. 

Definition 13. For ld{c) □ term{d) = {ti/wi\ti G ld{c) n T}, w'-f is calculated 
by 

+ Uwi for yti/wi G ld{c) □ term{d). 

Example 4- In Fig. 4, suppose we get rucm = 4, Wcn 2 = b '^cus = 4, with k = 9. 
Then Cn{d) = {LCD, CRT} with o;=4. Since = icbis “ algorithm 

would fail to get the unique category. So, local dictionaries are exploited for 
distinguishing LCD from CRT. Hence, 

ld{LCD) = {LCD} U assoc(LCD) U comp{LCD) U sym{LCD) U I (LCD) 
Usym{L{LCD)) 

= {“LCD,” “Digital TV,” “TFT-LCD,” “Flatron LCD,” 

“liquid crystal display,” “thin film transistor”}. 

Similarly, we can get 

ld{CRT) = (“CRT,” “Digital Receiver Amp,” “FTM,” “Dynafiat,” 

“cathode ray tube,” “Flat tension,” “DF”|. 

If we let 

term{d) = { “crt”/0. 36, “digital tv”/0.23, “display” /0. 33, “fiat tension” /0. 19, • • • , 
“fiatron led”/ 0.21, “led” /0.64, “monitor” /0. 35, “pdp”/0.19, • • • , 
“tft-lcd”/0.53, • • •}) 

then 

ld{LCD) n term{d) = (“digital tv”/0.23, “fiatron lcd”/0.21,“lcd”/0.64, 
“tft-lcd”/0.53|. 

ld{CRT) n term{d) = |“crt”/0.36,“fiat tension” /0. 19}. 

We now get = 4 + 1.61 = 5.61 and w^j^rp = 4 + 0.55 = 4.55. 

To ensure that the difference between vfpQp, and Wqj^p is not negligibly small, 
we may need the third threshold value j3 denoting meaningful weight difference. 
For example, LCD could be our choice only if (3 = 0.5. 

If the process still fails to select a unique category, the final alternative is 
to assign d to the super category of the categories which generalizes them. We 
provide the following proposition without proof to show a way of identifying the 
direct super category, which does not exist in Cn{d). 

Proposition 3. Let |C'jj((i)| > 2 and s be its cadinality. Then for G 

Cn{d), a category which directly generalizes each of them is Cipj -G-i- H 
denoted by 

Supdirecti.} e Cfl(d)}) = 
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In Example 4, if w^lcd ~ '^crt i then the alternative category would be monitor 

(cii), which generalizes LCD{cin) and CRT{cn 3 )-, Supdirect{{ciii, cns}) = cn. 

We are now in a position to propose the final version of our algorithm. 

Algorithm 1 Resolve {Cn{d),CR {d),d,Th) 

Begin 

1. Let c, c' G Cii{d), |C'i^(d)| > 2 and a predefined threshold value (3 > 0. 
Compute and w^, respectively for all c" G (r(c) U r(c')) A c" G Cn (d) by 
referring to Th. 

2. If > w^, for every c! G Cn{d) c' , then Return(c) 

else CR{d) < — {c, c'ltc^ = 

3. For each c G Cn{d), calculate wf = w]l + ^Wi for yti/wi G ld{c) n term{d). 

4. If — w'-j: > /? for every d G Cn{d), then Return(c). 

5. If |C'/{((i)| > 2 and c < — Supdirect{Cn{d)) yf 0, then Return (c) 
else assign d to each c G Cn{d) simultaneously. 

End 



4 Experimental Results 

In this experiment, we collected 427 documents from electronic-product review 
directories in Yahoo Korean Web site^. We held out 30% of the documents for 
the testing and used the remaining 70% for training, respectively. We used the 
six top level categories such as household appliance, computer, computer periph- 
eral device, computer component, audio equipment and video equipment and in 
turn 24 sub categories were made of them. The object-based thesaurus we used 
contains about 340 terms that were extracted from the data set. Categorization 
results with k-NN differ according to parameters such as the number of neigh- 
bors and the threshold values of candidate category sets. Different numbers of 
neighbors and threshold values are examined in this experiment. We first per- 
form modified k-NN on the hierarchically structured categories; if a document is 
assigned to a unique category, it is automatically assigned to its super categories 
by proposition 1. Next, the relationships of the thesaurus are employed with 
the modified k-NN according to Algorithm 1. Table 1 shows the experimental 
results of k-NN, modified k-NN and modified k-NN with the thesaurus(or briefly 
k-NN -I- Thesaurus). The threshold value of the candidate category sets a and 
the number of neighbors k are 7 and 17, respectively. 

Even in classifying with k-NN, since categories on higher levels tend to have 
high weight, in most cases, documents are assigned to them if each document 
should have only one target category. Therefore, the precision of k-NN becomes 
abnormally high by 90.7%, while its recall is 47.9% which is considerably low. 
The modified k-NN which is allowed to take more than one category can boost up 

^ http:/ /kr. yahoo.com/Computers_andJnternet/Product_Reviews 
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Table 1. Result of classification with k=17 and a = 7 





Method 


Precision 


Recall 


F-measure 


classification with 
hierarchical structure 


k-NN 

modified k-NN 
k-NN-f Thesaurus 


90.73% 

84.03% 

89.58% 


47.94% 

88.14% 

93.04% 


62.73% 

86.04% 

91.27% 


classification with the 
lowest level categories 


k-NN 

modified k-NN 
k- N N-f Thesaurus 


72.85% 

71.26% 

86.71% 


80.29% 

90.51% 

90.51% 


76.39% 

79.74% 

88.57% 



Table 2. Result of classification with k-NN on the lowest level categories 



k-NN Precision 


Recall 


F-Measure 


k=14 


76.82% 


84.67% 


80.56% 


k=15 


75.51% 


81.02% 


78.17% 


k=16 


74.15% 


79.56% 


76.76% 


k=17 


72.85% 


80.29% 


76.39% 


k=18 


75.17% 


81.75% 


78.32% 



Table 3. Result of classification with “k-NN -|- Thesaurus” on the lowest level cate- 
gories 



k-NN + Thesaurus Precision 


Recall 


F-Measure 


k=14, q=6 


87.77% 


89.05% 


88.41% 


k=15, a=6 


86.52% 


89.05% 


87.77% 


k=16, 0=7 


84.89% 


86.13% 


85.51% 


k=17, 0=7 


86.71% 


90.51% 


88.57% 


k=18, 0=7 


87.05% 


88.32% 


87.68% 



the recall up to 88.14%. The precision of “k-NN-|-Thesaurus” is improved about 
5.5% when compared with the modified k-NN - its enhancement in precision 
is not prominent in the hierarchical structure. The reason is that the modified 
k-NN can assign a document automatically to the super categories of a category 
c as well as c, i.e., even though the categorization on lower levels is not correct, 
on higher levels it remains correct. However, if we experiment the categorization 
on the lowest level categories, “k-NN -|- Thesaurus” can improve the precision 
of k-NN about 13.86%. It shows that with the modified k-NN, documents are 
fortunately assigned to proper categories on higher levels, but the same is not 
usually true on lower levels. 

For the comparison. Table 2 shows the result of k-NN with different number 
of neighbors considering the lowest level categories alone. 

Table 3 shows the result of “k-NN-|-Thesaurus” considering the lowest level 
categories with different number of neighbors and different threshold values. 

As shown in Table 2 and Table 3, our method drastically improves the pre- 
cision due to the removal of the ambiguity, which remained unsolved in k-NN. 
Fig. 5 clearly shows the enhancement of the recall and precision listed in Table 3. 




Automatic Document Categorization 



111 




Fig. 5. Enhancement of recall and precision 



Remark the drastic enhancement of the precision in comparison with that 
of the recall. It is due to the effect of resolving ambiguity. To be specific, the 
enhancement is apparent especially when k=17, i.e., when maximal ambiguity 
arises. The reason is that since the average number of training documents for 
each category ranges from 10 to 15, most of them are likely to participate in k 
neighbors of a document d, whenever it needs to be assigned to some categories. 

F-measure of our method wholly relies on the number of thesaurus terms 
used in an experiment. Therefore, if refining the object-based thesaurus with 
more terms, we may expect more improved performance. Readers may refer to 
[3], which deals with semi-automatic construction technique available to easily 
create such sophisticated thesauri. 

5 Conclusions and Future Research 

In this paper, we proposed an automatic text classification method to enhance 
the classification performance of k-NN with the object-based thesaurus. By re- 
ducing ambiguity frequently appearing in k-NN, our method drastically im- 
proved the precision of k-NN, preserving its recall. Since the ambiguity problem 
is inherent in other automatic document classifiers, we expect that out method 
can also be adopted to enhance their performance if appropriately coupled with 
them. 

As future research, applying our method to semantic web application would 
be more meaningful. For example, labeled hyperlinks encoded in the semantic 
web documents could provide more fruitful clue to removing the ambiguity. 
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Abstract. This work provides algorithms and heuristics to index text docu- 
ments by determining important topics in the documents. To index text docu- 
ments, the work provides algorithms to generate topic candidates, determine 
their importance, detect similar and synonym topics, and to eliminate incoher- 
ent topics. The indexing algorithm uses topic frequency to determine the impor- 
tance and the existence of the topics. Repeated phrases are topic candidates. Eor 
example, since the phrase ‘index text documents’ occurs three times in this ab- 
stract, the phrase is one of the topics of this abstract. It is shown that this 
method is more effective than either a simple word count model or approaches 
based on term weighting. 



1 Introduction 

One of the key problems in indexing texts by topics is to determine which set of 
words constitutes a topic. This work provides algorithms to identify topics by deter- 
mining which sets of words appear together within a certain proximity and how often 
those words appear together in the texts. 

To count the frequencies of topics in texts accurately, a system must be able to de- 
tect topic repetition, similarity, synonymy, parallelism, and implicit references. 
However, these factors are not all equally important. We have found that topic repeti- 
tion and topic similarity are the most useful and are sufficient to produce good indi- 
ces. 

The work described in this paper provides algorithms to detect similar topics in 
texts. For example, if a text contains the phrase ‘a native American history book’ and 
‘this book is about the history of native Americans’, our system, iindex, detects both 
phrases as similar, counts the frequency of topic ‘native American history book’ as 
two, and makes the phrase a candidate topic. The iindex system also detects topics 
that are synonyms and sums their frequencies to represent the synonyms together as 
one meaning. This is important because the same topic can be expressed in several 
different ways. For example, the phrases ‘topic identification’, ‘topic determination’, 
‘topic discovery’, ‘finding topic’, ‘locating topic’, and ‘topic spotting’ can all serve as 
synonyms. 

Among similar phrases, the iindex system extracts shorter and best phrases from 
texts as topic candidates. For example, the phrase ‘blood pressure’ is selected over 
‘pressure of the blood’. In addition, unlike previous approaches, such as [WEKOl], 
iindex extracts any important phrases from texts, not just simple noun phrases. For 
example, expressions such as ‘high blood pressure has no symptoms’ and ‘blood 
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pressure should be monitored more frequently’ are extracted from texts; these expres- 
sions would be missed by a noun phrase indexer. 

The major contributions of this work are techniques and algorithms to determine 
and to order the most important topics in text documents and to index text documents 
efficiently based on important topics in the texts without employing linguistic pars- 
ing. It efficiently solves the problem of finding important topics in texts, a problem 
that requires exponential computation time, by carefully selecting subsets of the prob- 
lem that are practical to compute, yet useful as they cover 97% of the problem do- 
main. The approach also provides a method that defines topic synonyms with infer- 
ence complexity 0(log n) or better. 



2 Background 

Over the past 30 years, a number of approaches to information retrieval have been 
developed, including word-based ranking, link-based ranking, phrase-based indexing, 
concept-based indexing, rule-based indexing, and logical inference-based indexing 
[Ha92, Sa89]. 

The closest work to iindex is that of Johnson [JCDh- 99] and Aronson [ABCh- 00]; 
iindex, however, applies a much richer set of techniques and heuristics than these two 
approaches. For example, iindex allows one to configure the maximum number of 
words in a phrase, whereas in prior work the phrase size has been fixed (3 in Johnson 
and 6 in Aronson). The iindex system also uses limited stemming as opposed to stan- 
dard stemming. (We describe both methods and explain the weaknesses of standard 
stemming, in Section 3.3.) iindex also considers complete documents as its input, 
while Aronson uses only the titles and abstracts. Finally, iindex uses a set of config- 
urable matching techniques, while Johnson uses just one. 

Fagan [Fa87] is one of the first to examine the effectiveness of using phrases for 
document retrieval. He reports improvements in precision from -1.8% to 20.1%. As 
in other prior work, his phrase construction is limited to 2-word phrases and uses 
standard stemming. Similarly, Kelledy and Smeaton [KS97] report that the use of 
phrases improves the precision of information retrieval. They use up to 3-word 
phases and employ standard stemming. They also require that phrases appear in at 
least 25 different documents, whereas iindex uses any phrases that are repeated in any 
document. Consequently, their approach would miss newly coined phrases that are 
repeated in only one document, such as ‘limited stemming’ in this document. Also, 
unlike iindex, they do not consider phrase variants such as ‘department of defense’ 
and ‘defense department’ as equivalent. Mitra et al. [MBSC97] describes a repetition 
of the experiments by Fagan with a larger set of about 250,000 documents, limiting 
the approach to 2-word phrases that appear in at least 25 documents, employing 
standard stemming, and ignoring word order. They conclude that the use of phrases 
does not have a significant effect on the precision of the high rank retrieval results, 
but is useful for the low rank results. 

The work by Wacholder [WEKOl] indexes only noun phrases, whereas iindex 
considers all types of phrases. Moreover, Wacholder ranks the topics by the fre- 
quency of the head noun alone, whereas iindex ranks the topics by the frequency of 
the whole phrase. 

Woods [Wo97] provides another approach to topic identification, but, unlike iin- 
dex, does not use frequency in determining topic rankings. 
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3 Indexing by Topic 

A topic is a set of words, normally a phase, that has meaning. Topics are determined 
by detecting sets of words that appear together within certain proximity and counting 
how often those words appear together. The more frequent a set of words in the 
document, the better the chance that set of words represents an important concept or 
topic in the document. Generally, the more (significant) words in a topic the more 
specific the topic. Similar topics are grouped (and later stored) by a process that we 
call topic canonization. This process involves converting the words in a phrase to 
their base forms and then ordering the words alphabetically. The resulting phrase is 
called the canonical phrase. We discuss our methods for determining topic length, 
topic proximity, and topic frequency below. 

A sentence or a phrase is a string of characters between topic separators. Topic 
separators are special characters such as period, semicolon, question mark, and ex- 
clamation mark that separate one topic from another. A word is a string of characters 
consisting of only a-z, A-Z, and 0-9. The approach ignores tokens that are numbers, 
hyphens, possessive apostrophes and blank characters. 

The topic length is the number of significant words that constitute a topic (sen- 
tence or phrase.) Significant words are those that have not been predefined as stop 
words. A stop word is high-frequency word that has no significant meaning in a 
phrase [Sa89]. iindex uses 184 stop words. They are manually selected as follows: 
all single characters from a to z, all pronouns, terms frequently used as variable 
names such as tl, t2, si, s2, and words that were selected manually, after evaluating 
the results of indexing several documents using iindex. 

The maximum and minimum values for topic length are configurable parameters 
of iindex (discussed in Section 4). iindex also provides default settings. The default 
maximum length is 10 and the minimum length is 2. These values were selected be- 
cause it has been reported that the average length of large queries to a major search 
engine (Alta Vista) is 2.3 words [SHMM98]. 

Topic proximity is the maximum distance of words apart that constitute a topic. For 
example, the phrase ‘a topic must be completely within a sentence’ is about ‘sentence 
topic’ and the two words are 6 positions apart. Thus, for this example, the topic prox- 
imity is 6. 

The topic frequency, or reference count, is the number of times that a topic, similar 
topic, or synonymous topic is repeated in the document. In our approach, the impor- 
tance of a topic is measured by its frequency. A topic is relevant to a unit of a docu- 
ment if the topic is referenced more than once in the unit. A unit of a document can 
be the whole document, a section, or a paragraph. 

3.1 Indexing Algorithm 

The goal of this algorithm is, given a set of documents D, to find a set of vv-word 
topics that are repeated r times in the documents. The words that constitute a topic 
should not be separated by more than p positions. 

For example, given document D = “abcdbc”, where each letter represents a word, 
the list of phrases of any 2 words at most 1 position apart is {ah, be, cd, db, be}. Each 
phrase has frequency 1, except phrase ‘be’ which has frequency 2. The phrases with 
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the highest frequency are the most important topics. In this case, the only topic is 
‘be’, as a topic must have a frequency of at least 2. 

Let M be a unit of a document d in D. By default, u is the whole document. Let X 
be the index of D, which is the set of topics that are repeated at least r times in u. 
Each index entry x in X represents a relation between topic f, unit u, and the fre- 
quency of f in M and is denoted as x(t, u, f). The index is represented by X(T, U, F) 
where T is the set of all topics in D, U is the set of all units in D, and L is a set of 
integers. By definition, fx(t, u, fl)j union fx(t, u, f2)j = (x(t, u, fl+f2)j i.e. we sum 
the frequencies of t in u. The frequency of topic f in unit u is denoted by x(f) for a 
given index entry x( t, u, f). 

Algorithm 1 Indexing Algorithm 

1. For each M in (i, do the following. 

a. Let Xu be the index of u. Initialize Xu to empty. 

b. Let j be a sentence in u. 

c. Remove stop words and numbers from s. Ignore s if it is one word or less. 

d. For each sentence j in m do the following. 

i. Generate topic candidates T from s (Section 3.2). 

ii. For each topic f in T, do the following. 

1. Perform limited stemming on t (Algorithm 2). 

2. Perform topic canonization on t. 

iii. Eliminate topics in T that are overlapping in position. 

iv. Merge and sum the frequencies of topics T that are the same, simi- 
lar or synonyms, to produce index entry x( t, u, f) and add it into Xu. 
Notice that x(t, u, fl +f2) replaces both x(t, u, fl ) and x(t, u, f2) in 
Xu. 

e. Remove index entries x from Xu that do not satisfy any of the following 
conditions: 

i. Topic t consists of significant words less than w. 

ii. Topic t contains duplicate words. 

iii. Topic t is a subset of other topics and t is not a stand-alone topic. 

f. For each topic f in Xu, remove extraneous words from t (Algorithm 5). Re- 
move t if it is reduced to one word or less. 

2. For each document in D do the following. 

g. Let Xd be the index of d. Set Xd is the union of Xu from each u in d. In doing 
so, replace u with d in index entry x( t, u, f). 

h. Remove x from Xd if x(f) < r. 

3. The index X is the union of Xd and Xu from each u in d and from each d in D. 

3.2 Topic Generation 

Given a sentence of length s, this algorithm generates all possible phrases (topics) of 
length 2 to w words with words up to p positions apart. The algorithm systematically 
generates all possible phrases as described in the following example. 
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3.2.1 An Example 

Let’s generate all 3-word phrases of at most 3 positions apart from a text document 
“abcde...z”. In this case, each letter represents a word. For a 3-word phrase, there are 
only 2 possible slots inside the phrase as shown in pattern XzXzX, where X represents 
one word and z represents a slot. For each slot z, we may skip 0, 1, or 2 words, i.e. at 
most 3 positions apart. The list of patterns is shown in Table 1. The dash signs in the 
patterns represent words that are skipped. 

Table 1. List of patterns for generating topic candidates 



# 


Slots 


Patterns 


Phrases 




#Phrases 


1 


0 


0 


XXX 


abc , 


bed, 


ede , ... 


24 


= 


26-3+1- (0+0) 


2 


0 


1 


xx-x 


abd, 


bee , 


cdf , ... 


23 


= 


26-3+1- (0+1) 


3 


0 


2 


XX- -X 


abe , 


bef , 


edg, ... 


22 


= 


26-3+1- (0+2) 


4 


1 


0 


x-xx 


acd, 


bde , 


cef, ... 


23 


= 


26-3+1- (1+0) 


5 


1 


1 


x-x-x 


ace , 


bdf , 


ceg, ... 


22 


= 


26-3+1- (1+1) 


6 


1 


2 


X-X--X 








21 


= 


26-3+1- (1+2) 


7 


2 


0 


X--XX 








22 


= 


26-3+1- (2 + 0) 


8 


2 


1 


X--X-X 








21 


= 


26-3+1- (2+1) 


9 


2 


2 


X--X--X 








20 


= 


26-3+1- (2+2) 



The number of patterns is 3^2 = 9. The number of phrases, 24 H- 23 H-... -H 20 = 
198, is less than 9 * 24 = 216, because there are 9 patterns each of which cannot gen- 
erate more than 24 phrases (each phrase contains at least 3 words). 

3.2.2 Computational Complexity 

The number of patterns consist of w words at most p positions apart is p'" ^ . An 
upper bound of the number of phrases of w words at most p positions apart generated 
from one sentence of length i is {s — W + \)p'^ ^ . Thus, the number of phrases 
/ ( 5 , W, p) i& less than (^S — W + V) p'^ \ The number of phrases consisting of 2 to 
w words is g(5, W, p) = XL • 

3.2.3 Computational Performance 

Worst Case 

Table 2 shows the performance of iindex on the worst-case scenario of generating all 
possible phrases from one sentence of unique words wl, w2, ..., wl24. The value of 
s = 124 is the longest sentence found among all text documents evaluated in this 
work. The value of w = 10 is the default value set for iindex. 

The numbers in the table were computed by iindex. The computer specified in 
Section 4 ran out of memory when the iindex tried to compute g(I24, 10, 3). There- 
fore, the computation time for g(I24, 10, 3) is an estimate as indicated by the asterisk. 

Average Case 

Although the worst case scenarios are almost impossible to compute, the average 
cases can be computed efficiently, as shown in Table 3. The table shows the perform- 
ance of generating all possible phrases from one sentence consisting of 15 unique 
words. The value of s = 15 and w = 3 are based on the average sentence length and 
average topic length of all text documents evaluated in this work. 
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Table 2. The performance of a worst-case scenario 



g(s, W, p) 


Patterns 


Phrases 


Minutes 


g(124, 10, 1) 


9 


1071 


0 


g(124, 10, 2) 


1,022 


114,437 


14 


g(124, 10, 3) 


29,523 


3,158,934 


*386 


g(124, 10, 4) 


349,524 


35,767,926 


*4,371 


g(124, 10, 5) 


2,441,405 


238,647,305 


*29,161 



Table 3. The performance of an average-case scenario 



g(s, w, p) 


Patterns 


Phrases 


Milliseconds 


g(15,3, 1) 


2 


27 


30 


g(15,3,2) 


6 


75 


40 


g(15,3,3) 


12 


138 


40 


g(15,3, 12) 


111 


555 


90 



Best Case 

The best-case scenario is when almost all problem instances are covered in a reason- 
able amount of time. In this work, 97% of sentences had 43 words or less and 97% 
of the topics generated from all the documents had length 6. Based on those values, 
the performance of the algorithm is computed as shown in Table 4. The empirical 
results show that we can compute g(43, 6, 3) in 7 seconds. That means it is practical 
to compute the index of text documents that contain sentences up to 43 words long, 
topics up to 6 words long, and topic proximities up to 3 positions apart. 



Table 4. The performance of the best-case scenario 



g(s, w, p) 


Patterns 


Phrases 


Seconds 


g(43, 6, 1) 


5 


200 


0 


g(43, 6, 2) 


62 


2,279 


0 


g(43, 6, 3) 


363 


12,327 


7 


g(43, 6, 4) 


1,364 


42,722 


53 


g(43, 6, 5) 


3,905 


112,250 


156 



With this approach, we efficiently solve the problem of finding important topics in 
texts, a problem that requires exponential computation time, by carefully selecting 
subsets of the problem that are practical to compute, yet cover 97% of the problem. 



3.3 Similar Topic Detection 

Topic tl is similar to topic t2 if they have the same significant base words. Significant 
words are those that are not stop words. Base words are those that have been con- 
verted to their root forms by a process called limited stemming, described below. 
Examples of similar topics are ‘repeated term’, ‘repeated terms’, ‘term repetition’, 
and ‘repetition of terms’. 

Limited stemming is the process of converting word forms to their base forms 
(stems, roots) according to a set of conversion rules, F, as part of the simple grammar 
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G described in Section 3.4. Only those words in F are converted to their base forms, 
in addition to the automatic conversion of regular forms as described in the following 
algorithm. 

Set F includes a list of irregular forms and their corresponding base forms as de- 
fined in the WordNet [Mi96] list of exceptions (adj.exc, adv.exc, noun.exc, verb.exc). 
Examples of irregular forms are ‘goes’, ‘went’, and ‘gone’ with base form ‘go’. The 
stemming is represented by one rule: go goes | went | gone. 

Word forms that have the same sense in all phrases, but are not included in the 
WordNet list of exceptions are manually added to F. Examples of such word forms 
are ‘repetition’ with base form ‘repeat’ and the word ‘significance’ with base ‘signifi- 
cant’ . 

Algorithm 2 Limited Stemming Algorithm 

This algorithm returns the base form of a given word w or null. 

1 . If word w is defined in F then return its base form. 

2. Else 

a. If either suffix ‘s’, ‘ed’, or ‘ing’ exists at the end of word w then truncate the 
suffix from w to produce w’. 

b. If length of w’ is at least 2 then return w’. 

c. Return null. 

The limited stemming algorithm above has been developed to avoid some of the 
errors that arise when a standard stemming algorithm (such as described in [Sa89]) 
predicts that two words have the same meaning when they do not [Ha92, Ea87]. For 
example, the word ‘importance’ should not be stemmed to ‘import’ because the two 
words are semantically unrelated. 

As mentioned above, stop words and word order are ignored when determining 
topics. When these ideas are combined with limited stemming, the following phrases 
are detected as similar: ‘repeated terms’, ‘repeated term’, ‘term repetition’, ‘repetition 
of terms’. This heuristic will not always work. For example, it will never be able to 
distinguish between ‘absence of evidence’ and ‘evidence of absence’. However, we 
have found very few cases of this sort. 

Algorithm 3 Similar Topic Detection 

The following algorithm determines if topic tl is similar to topic t2. 

1. Remove stop words from tl and t2. 

2. Perform limited stemming on tl and t2. 

3. Order words in tl alphabetically. 

4. Order words in t2 alphabetically. 

5. Return true if tl is identical to t2. 

3.4 Synonymous Topic Detection 

Phrases that have the same meaning are called phrase synonyms or topic synonyms. In 
addition to topic canonization, phrase synonyms can be defined explicitly by adding 
production rules, S, to the simple grammar G defined below. For example, the follow- 
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ing production rule specifies that phrases ‘topic identification’, ‘determine topics’, 
‘discover topics’, and ‘topic spotting’ are synonyms: topic identification ^ determine 
topics I discover topics | topic spotting . 

The rules in S are manually constructed to improve the quality of the index. How- 
ever, the iindex produces good indices without defining any rules in S. 

Phrase synonyms share one meaning called the synonym meaning, which is repre- 
sented by the string at the head of the production rule. In the above example, the 
synonym meaning is string ‘topic identification’. Each phrase (node) in the produc- 
tion rule represents a set of similar phrases. 

Topic tl is synonymous to topic t2 if and only if the synonym meaning of tl is lit- 
erally the same as the synonym meaning of t2. 

A simple grammar, G, is used to represent both stems for words and synonyms for 
topics. It is called a simple grammar because it can be implemented with a simple 
look up table with logarithmic complexity 0(log n) where n is the number of entries 
in the table (the same as the number of terms in the production rules.) The grammar 
could be implemented with constant complexity 0(1) using hashing. 

There are 4519 rules defined in the current implementation of iindex. The rules de- 
fine 11452 mappings of one string to another. 

Algorithm 4 Synonymous Topic Detection 

1. Remove stop words from tl and t2. 

2. Convert topic tl and t2 to their canonical phrases. 

3. Let gl be the set of synonym rules with tl. Let g2 be the set of synonym rules 
with t2. (Both gl and g2 are subsets of the simple grammar G.) 

4. If intersection of gl and g2 is not empty, then tl and t2 are synonyms, otherwise 
they are not. 

3.5 Topic Elimination 

The iindex generates some incoherent phrases, such as ‘algorithm for determining’ 
and ‘automatic indexing involves’, during the indexing process. Those phrases need 
to be removed from the index. 

Topics that contain duplicate words are also removed because we have found that 
they are mostly incoherent. An example phrase with duplicate words is ‘string the 
string’. The iindex generates the phrase from [Ka96] because the phrase is repeated 
twice (ignoring stop words) as follows. 

“. . . denotes the empty string, the string containing no elements ...” 

“. . . machine has accepted the string or that the string belongs ...” 

3.5.1 Remove Extra Words from Topics 

This section describes heuristics to remove some incoherent phrases or to transform 
them into coherent phrases. 

Define B as the set of words and phrases to be eliminated from the beginning of 
topics and E as the set from the end of the topics. S is the set of stop words. Sets B 
and E are constructed manually. Examples are B = {according to, based on, follow- 
ing, mentioned in} and E = (using, the following, involves, for combining, for deter- 
mining, to make } . 
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Algorithm 5 Removal of Extraneous Words from Topics 

1. Remove consecutive words or phrases from the beginning of topic t if they are in 
B or S. 

2. Remove consecutive words or phrases from the end of t if they are in E or S. 

3. If t is reduced to one word or less then do not use t, otherwise use t. 

4 Implementation 

The iindex system has been written in C++. Experiments were performed on a laptop 
with the following hardware and software: Pentium 4, 2 GHz, Microsoft Windows 
2000 Professional, 768 MB memory, and 37 GB hard drive. 

The inputs to iindex are plain text documents in ASCII format. The limited stem- 
ming is defined in a file forms.txt, topic synonyms in rules.txt, stop words in stop- 
Words.txt, and topic separators in topicSeparator.txt. Parameters with default values 
such as s = 50, w = 10, p = I, r =2 are configurable in param.txt, where 5 is the 
maximum length of sentences, w is the maximum length of topics, p is the proximity 
of topics, and r is the minimum phrase frequency needed to be considered a topic. 



5 Results and Evaluation 

The iindex correctly and efficiently finds the most important topics in various types 
and lengths of text documents, from individual sentences and paragraphs to short 
papers, extended papers, training manuals, and PhD dissertations. Titles and abstracts 
were not marked in any special way and thus are not known to iindex. The topics 
extracted from texts are ordered by their importance (topic frequencies). 

The iindex finds 477 topics in [Wi98], a training manual, as shown in Table 5. (N 
= sequence number, TF = topic frequency, WF = word frequency average). It cor- 
rectly extracts the topic ‘blood pressure measurement’ as the third most important 
topic, the topic mentioned in the title of the text. It is indeed true that the text is about 
blood pressure, high blood pressure, and blood pressure measurement as suggested by 
the first 3 most important topics. 



Table 5. List of important topics in blood pressure measurement manual 



N 


TF 


WF 


Topics 


1 


250 


306 


blood pressure 


2 


56 


227 


high hlood pressure 


3 


46 


217 


hlood pressure measurement 


4 


19 


38 


american heart association 


5 


19 


157 


hlood vessels 



The iindex finds 42 topics in [Ka96], a short paper. It correctly extracts the topic 
‘finite state technology’ as the second most important topic, which is exactly the title 
of the paper. It is indeed true that the paper is about finite state, finite state technol- 
ogy, and regular language as suggested by the first 3 most important topics. 
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The iindex finds 2172 topics in [Wo97], an extended paper. It correctly extracts 
the topic ‘conceptual indexing’ as the most important topic, which is exactly the title 
of the paper. It is indeed true that the text is about conceptual indexing, conceptual 
taxonomy, and retrieval system as suggested by the first 3 most important topics. 

The iindex finds 2413 topics in [Li97], a PhD thesis. It correctly extracts the 
phrase ‘topic identification’ as the second most important topic, the topic mentioned 
in the title. It is indeed true that the text is about topic signatures, topic identification, 
precision and recall as suggested by the first 3 most important topics. 

5.1 Speed of Indexing 

Overall, the iindex is very effective and very efficient in finding the most important 
topics in text documents. It takes 34 seconds to index a 100-page (46145-word) text 
[Wo97]. It takes only 3 seconds to find 482 important topics among 23166 possible 
phrases in the [Wi98] training manual and less than 1 second to find 43 important 
topics among 5017 possible phrases in [Ka96]. 



5.2 Comparisons to the Word Count Model 

The word count model ranks the topics based on the word frequency average listed in 
column WF of Table 6. The word count model ranks the topic ‘blood pressure cuff 
extremely high (2“^^), a topic that is mentioned just 2 times in [Wi98]. It ranks this 
topic higher than the topic ‘blood pressure measurement’ , a topic that is mentioned 46 
times. It is unlikely that topic ‘blood pressure cuff is more important than topic 
‘blood pressure measurement’ in the document. On the other hand, the iindex cor- 
rectly infers that topic ‘blood pressure measurement’ is much more important (3'”'^) 
than topic ‘blood pressure cuff (262"“*) in the document as shown in Table 5. The 
iindex thus determines the importance of topics in this document more accurately 
than the word count model does. 

Table 6. List of important topics in blood pressure measurement manual by word count average 
order 



N 


TF 


WF 


Topics 


1 


250 


306 


blood pressure 


2 


2 


242 


blood pressure cuff 


3 


8 


229 


blood pressure to clients 


4 


56 


227 


high blood pressure 


5 


17 


221 


elevated blood pressure 



5.3 Comparisons to the TFIDF / Term Weighting Model 

The best term weighting model is according to Salton and Buckley [SB88] who 
evaluated 287 different combinations of term-weighting models. However, ifiJf fails 
to find the most important topic ‘voting power’ from a Wall Street Journal text, ac- 
cording to [Li97, page 73] while the iindex correctly finds it as shown in Table 7. 
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The iindex finds more specific and meaningful topics such as voting power, million 
shares, and eastern labor costs, while tfidf finds less specific topics such as Lorenzo, 
holder, voting, proposal, etc. The iindex is thus better at identifying important and 
specific topics than tfidf. 



Table 7. List of important topics in Wall Street Journal text identified by tfidf and iindex 



Rank 


tfidf 


iindex 


Term 


Weight 


Phrase 


Frequency 


1 


Lorenzo 


19.90 


voting power 


2 


2 


holder 


9.66 


million shares 


2 


3 


voting 


9.05 


eastern labor costs 


2 


4 


proposal 


8.03 






5 


50.7% 


7.61 
















16 


power 


5.01 

















6 Conclusions 

This paper presents iindex, an effective and efficient approach to indexing text 
documents based on topic identification. A topic is any meaningful set of words that 
is repeated at least twice in the texts. The determination of topics is based on the 
repetition of the words that appear together within texts. To measure topic frequen- 
cies in texts more accurately, iindex detects topics that are similar or synonymous. It 
is also highly configurable. 

iindex allows users to configure the length of phrases, the maximum gap between 
words in a phrase, the maximum sentence length, the sets of words to be considered 
as synonyms, the stems of irregular words, the set of stop words, the set of topic sepa- 
rators, and the minimum phrase frequency for topics, iindex also provides useful 
defaults for these values; for example by choosing a sentence maximum of 50 words, 
a phrase length of 10, and a word proximity of 1, it can produce a good index of a 
100-page (about 46145-word) text in 34 seconds. 
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The organization of large text collections is the main goal of automated text 
categorization. In particular, the final aim is to classify documents into a certain 
number of pre-defined categories in an efficient way and with as much accuracy 
as possible. On-line and run-time services, such as personalization services and 
information filtering services, have increased the importance of effective and 
efficient document categorization techniques. In the last years, a wide range of 
supervised learning algorithms have been applied to this problem [1]. Recently, 
a new approach that exploits a two-dimensional summarization of the data for 
text classification was presented [2] . This method does not go through a selection 
of words phase; instead, it uses the whole dictionary to present data in intuitive 
way on two-dimensional graphs. Although successful in terms of classification 
effectiveness and efficiency (as recently showed in [3]), this method presents 
some unsolved key issues: the design of the training algorithm seems to be ad 
hoc for the Reuters-21578^ collection; the evaluation has only been done only on 
the 10 most frequent classes of the Reuters-21578 dataset; the evaluation lacks 
measure of significance in most parts; the method adopted lacks a mathematical 
justification. We focus on the first three aspects, leaving the fourth as the future 
work. 

The definitions and the experimental setup of [3] were adopted in this work. 
The baseline was the support vector machines (SVM) learning method using the 
implementation^. The Focused Angular Region (FAR) algorithm [3] 
was compared with SVM. For the experimental evaluation, we added two more 
datasets to the above mentioned Reuters-21578 (see the details in [3]): first, 
the 20Newsgroups^ which contains about 20,000 articles evenly divided among 
20 UseNet discussion group. We randomly divided the collection in two subset: 
the 70% was used to train the classifier and the remainder to test the perfor- 
mance. Second, the new RCVl'^ Reuters corpus. We focused here on the 21 
non-empty sub-categories of the main category named GCAT. We trained on 
the first month (Aug 20 1996, Sept 19 1996) with 23,114 documents, and tested 
on the last month (Jul 20 1997, Aug 19 1997) with 19,676 documents. Standard 

^ http:/ /www. daviddlewis.com/resources/testcollections/reuters2 1578/ 

^ http://svmlight.joachims.org/ 

® http:/ /www. ai.mit.edu/~jrennie/20Newsgroups/ 

^ http:/ /about. reuters.com/researchandstandards/corpus/ 
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Table 1. Upper half: Fi macro- and micro-averaged comparison between SVM and 
FAR algorithm together with training times. Lower half: sign test (ST) and signed rank 
test (SRT) results. “>” means 0.01 < P-value < 0.05. means P-value > 0.05. The 
last row reports the P-value 





Reuters-21578 


20Newsgroups 


RCVl 




SVM - FAR 


SVM - FAR 


SVM - FAR 


Fi- macro 


.866 - .801 


.685 - .623 


.754 - .701 


Fi- micro 


.923 - .868 


.687 - .606 


.577 - .552 


Training time (seconds) 


16.09 - 4.29 


813.01 - 14.88 


439.76 - 24.57 




ST - SRT 


ST - SRT 


ST - SRT 


mo = .02 


~ ~ 


> - > 


~ ~ 


mo = .03 


~ ~ 


~ ~ 


~ ~ 


mo = .04 (P-value) 


(.377) - (.492) 


(.252) - (.277) 


(.668) - (.892) 



IR evaluation measures have been computed. Recall pi and Precision tti were 
calculated for each category Ci, together with micro- and macro-averaged esti- 
mates of the collections. The F\ measure was calculated for each category as well 
as the overall F\ macro- and micro-averaged measures (see definitions in [1]). A 
controlled study using two statistical significance tests was made to compare the 
two classification methods: the sign test (ST) and the signed rank test (SRT) 
(see [4]). The paired F\ measures for individual categories and the magnitude of 
the differences between paired observations as well as their signs were used. For 
the ST, the null hypotheses was Flo '■ tn < mo, where m is the simple average of 
the differences between paired F\, while for SRT the null hypothesis was Ho : 
the distribution of the differences is symmetric with respect to mo. 

The final results are reported in Tab. 1. FAR algorithm demonstrates to be 
effective and efficient on different collections. The training time is one order 
of magnitude less than SVM (there are also some cross-validation aspects in 
favor of FAR given in [3]), and the difference in performance with respect to 
the baseline appears to be constant. Significance tests clearly show that FAR 
algorithm performs no worse than four point percentage on average with respect 
to SVM. It is worth noting that the P-value of the last row of Tab. 1 indicates 
a strong evidence for not rejecting Hq when mo = .04. 
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Abstract. In this paper we study the interest of integration of an over- 
lapping clustering approach rather than traditional hard-clustering ones, 
in the context of dimensionality reduction of the description space for 
document classification. 

The Distributional Divisive Overlapping Clustering (DDOC) method is 
briefly presented and compared to Agglomerative Distributional Cluster- 
ing (ADC) [2] and Information-Theoretical Divisive Clustering (ITDC) 

[3] on the two corpus Reuters-21578 and 20Newsgroup. 

1 Introduction 

Document classification is usually based on word distributions into a collec- 
tion of documents. However, the size of the vocabulary leads to a very large 
description space which can be reduced from different ways: word selection, 
re-parameterisation or word clustering. The last method aims at indexing the 
documents with clusters of words which present similar distributions under the 
class labels p{c\w). The two main algorithms are: the Agglomerative Distribu- 
tional Clustering (ADC) [2] and the Information-Theoretical Divisive Clustering 
(ITDC) [3]. 

Rather than build disjoint clusters, we propose here to produce overlapping 
clusters of words. We claim that “soft-clusters” match better with the natural 
non-exclusive membership of words to semantic concepts. 

2 The DDOC Method 

The Distributional Divisive Overlapping Clustering (DDOC) method is inspired 
from the clustering algorithm PoBOC [1]. This algorithm has three main ad- 
vantages: first it produces overlapping clusters, then the number of clusters is 
not given as a parameter and finaly, it only requires a similarity matrix over the 
dataset. Nevertheless PoBOC is not suitable for very large databases (VLDB) 
then a traditional sampling is applied. An overview of the DDOC method is 
proposed in figure 1. 

First experiments aim at observing the power of overlapping word clusters 
indexing combined with a bayesian classifier. Figure 2 presents the results ob- 
tained on the Reuters corpus. Experiments on the Newsgroup dataset lead to 
almost identical results. 
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Input: The vocabulary V, a similarity matrix S, two parameters M and t, 
Output: A set of overlapping word clusters W = {Wi, . . . , Wk}, 

1 . Select the M words with higher mutual information with the class 
variable, 

2 . Apply PoBOC over this set of M words, resulting in k overlapping clusters, 

3. Assign the other words to these k clusters with a multi-assignment 
heuristic (cf. PoBOC), 

4. Iterate a reallocation stage from each word to one or several clusters until 
no change is observed or t iterations are achieved. 

Fig. 1. The DDOC algorithm. 




% Overlapping 




Number of clusters 



Fig. 2. (left) Classification Accuracy w.r.t importance of overlaps with DDOC. (right) 
Classification Accuracy: comparison between ADC, ITDC and DDOC. 



3 Conclusion 

Empirical evaluations of the DDOC method tend to conclude that the overlaps 
between word clusters can help at indexing better the documents, inducing a 
slightly better classifier than ADC or ITDC algorithms. Further works will con- 
cern a more formal study in the context of information theory and the use of 
Support Vector Machine (SVM) classifiers in our framework. 
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Abstract. In order to obtain accurate information from Internet web 
pages, a suitable representation of this type of document is required. 
In this paper, we present the results of evaluating 7 types of web page 
representations by means of a clustering process. 



1 Web Document Representation 

This work is focused on web page representation by text content. We evaluate 5 
representations based solely on the plain text of the web page, and 2 more which 
in addition to plain text use HTML tags for emphasis and the “title” tag. We 
represent web documents using the vector space model. First, we create 5 rep- 
resentations of web documents which use only the text plain of the HTML doc- 
uments. These functions are: Binary (B), Term Frequency (TF), Binary Inverse 
Document Frequency (B-IDF), TF-IDF, and weighted IDF (WIDF). In addition 
we use 2 more which combine several criteria: word frequency in the text, the 
words appearance in the title, positions throughout the text, and whether or 
not the word appears in emphasized tags. These representations are the Analitic 
Combination of Criteria (ACC) and the Fuzzy Combination of Criteria (FCC). 
The first one [Fresno & Ribeiro 04] uses a linear combination of criteria, whereas 
the second one [Ribeiro et al. 03] combines them by using a fuzzy system. 

2 Experiments and Conclusions 

We use 3 subsets of the BankSearch Dataset [Sinka & Come] as the web page 
collections to evaluate the representations: (1) ABC&GH is made up of 5 cat- 
egories belonging to 2 more general themes; (2) G&H groups 2 categories that 
belong to a more general theme; and (3) A&D comprises 2 separated categories. 
Thus, the difficulty of clustering the collections is not the same. We use 2 fea- 
ture reduction methods: (1) considering only the terms that occur more than a 
minimum times (“Mn”, 5 times); (2) removing all features that appear in more 
than X documents (“Mx”, 1000 times). For ACC and FCC we use the proper 

* Work supported by the Madrid Research Agency, project 07T/0030/2003 1. 
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Table 1. Clustering results with the different collections and representations 
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weighting function of each one as the reduction function, by selecting the n most 
relevant features on each web page (i. e. ACC(4) means that only the 4 most 
relevant features of each page are selected). Notice that only B, TF, ACC and 
FCC are independent of the collection information. A good representation is one 
which leads to a good clustering solution. Since we work with a known, small 
number of classes (2 in these collections) we use a partition clustering algorithm 
of the CLUTO library [Karypis]. We carry out an external evaluation by means 
of F-measure and entropy measures. 

The results can be seen in Table 1. It shows the number of features, the 
values of the external evaluation and the time taken in the clustering process. 
The experiments show that no single representation is the best in all cases. ACC 
is involved in the best results of 2 collections and the results of FCC are similar 
or, in some cases, better than with the others. These results suggest that using 
light information from the HTML mark-up combined with textual information 
leads to good results in clustering web pages. The ACC representation optimizes 
the web page’s representation using less terms, and does not need collection 
information. 
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Extended Abstract 

Searching information resources using mobile devices is affected by displays on 
which only a small fraction of the set of ranked documents can be displayed. In this 
study we explore the effectiveness of relevance feedback methods in assisting the 
user to access a predefined target document through searching on a small display 
device. We propose an innovative approach to study this problem. For small display 
size and, thus, limited decision choices for relevance feedback, we generate and study 
the complete space of user interactions and system responses. This is done by build- 
ing a tree - the documents displayed at any level depend on the choice of relevant 
document made at the earlier level. Construction of the tree of all possible user 
interactions permits an evaluation of relevance feedback algorithms with reduced 
reliance on user studies. From the point of view of real applications, the first few 
iterations are most important - we therefore limit ourselves to a maximum depth of 
six in the tree. 



Fig. 1. Decision tree for iterative relevance feedback, showing nodes in which the target docu- 
ment is reached, the rank of a document within each display, and the calculation of RF-rank for 
the target. This branch is expanded only till depth 5 because the target has been found 

We use the Rocchio relevance feedback scheme in conjunction with the tf-idf 
scheme where documents and queries are represented as vectors of term weights 
normalized for length, and similarity is measured by the cosine distance between 
these vectors. We only consider relevant documents, with the Rocchio feedback 
weights all being 1 . The search task is to find a randomly chosen target in the data- 
base using an initial query of four randomly chosen words from the target. The 
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evaluation metric is the total number of documents seen before the target is found. 
The baseline is the rank of the document after the initial query (Rscroii)> before any 
relevance feedback is applied. The minimum feedback rank (min Rj^p) for a given 
target document corresponds to the best case scenario where the user always provides 
the system with the optimal choice of document for relevance feedback, thus provid- 
ing an upper bound on the effectiveness of relevance feedback. The number of target 
document occurrences in a tree provides a measure of the likelihood of a non-ideal 
user locating the target document. At each search iteration, we display K=4 docu- 
ments to the user. The most obvious strategy is to display the K documents with the 
highest rank which is likely to result in a set of documents all very similar to one 
another. An alternative approach is to display a selection of documents such that a 
user’s response maximizes the immediate information gain to the system and helps to 
minimize the number of search iterations. This is approximated by sampling K docu- 
ments from the underlying distribution of similarity. In the experiments we use the 
Reuters-21578 collection of textual documents. Using the 19,043 documents that 
have non-empty “Title” and “Body” fields, we remove the stop words and create a 
vector representation of documents with weights. Table 1 contains the statistics 
of successful searches, ie; trees which contain the target. The RF rank of an ideal user 
is the minimum path length from the root of the tree to a node with the target, 
whereas the mean length of all paths leading to the target represents the average per- 
formance of successful users. For the Top-K scheme, 52 of the 100 trees contained 
the target, whereas the corresponding number was 97 for the sampled scheme. How- 
ever, 4.49% of paths in successful searches led to the target for Sampled displays as 
opposed to 46.67% for the Top-K. 



Table 1. Performance of Rocchio RF Algorithm based on the Initial Query 
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The results indicate that if the user’s query is sufficiently accurate, then the initial 
rank of the target document is likely to be high and scrolling or relevance feedback 
with a greedy display performs almost equally well. However, if the user’s initial 
query is poor, then scrolling is futile and relevance feedback with a display strategy 
that maximizes information gain is preferable. Amongst the two display strategies, 
the success of the greedy update relies on a good initial query, whereas the sampled 
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update provides performance almost independent of the initial query but is very sensi- 
tive to feedback. Future work includes the examination of other display strategies, 
including hybrid strategies that attempt to optimally combine the exploratory proper- 
ties of maximizing information gain with the exploitative properties of greedy dis- 
plays, and also to verify our results with a user trial. 
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Abstract. We propose and evaluate an approach for automatic informa- 
tion extraction(IE) by representing the extracted grammatical patterns 
as states of a Hidden Markov Model(HMM). Our experiments suggest 
that with the incorporation of simple extraction rules, the reliability and 
the performance of HMM based IE system are greatly enhanced. 



1 Introduction 

Recently, several research efforts have been devoted to the automatization of IE 
from text. While most of works fall into the category of extraction pattern learn- 
ing methods [1], HMMs have shown to be a powerful alternative. Nevertheless, 
none of the previous HMM-based IE methods does fully explore the wide array 
of linguistic information as the extraction pattern learning methods aim to do. 
We propose a novel HMM-based IE method in which a document is represented 
as a sequence of extracted grammatical patterns instead of a simple sequence of 
tokens. We call our model as eHMM for the reference purpose. 



2 Our Approach 

First, we induce the linguistic features from the candidate instances by con- 
structing rules out of them. We employ a covering algorithm which is motivated 
by Crystal [2]. Based on a similarity measure which is similar in spirit to the 
euclidean L\ norm and a simple induction rule, our algorithm greedily finds and 
generalizes rules while eliminating those instances that are already covered from 
the search space. The set of linguistic features are produced as the union of these 
induced rules. Next, our HMM is trained using this set of linguistic features. The 
topology of our model is based on that proposed by [3] which distinguishes back- 
ground, prefix, target and suffix states. Since a single and unambiguous path is 
possible under this particular topology, the transition probabilities are easily 
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computed by the standard maximum likelihood with ratios of counts. The emis- 
sion probabilities, on the other hand, are estimated by weighting each element 
of the set of linguistic features according to its similarity to the token being 
estimated. Extraction is performed by 2 steps: (1) each word of the document 
is mapped to the set of linguistic features, and then (2) the most likely state 
sequence is discovered using the standard Viterbi algorithm. More details are 
found at http://www.cs.toronto.edu/~leehyun/extraction.pdf. 



3 Experimental Results 

Our approach was tested on the 
CMU seminar announcement corpus 
which has been investigated by vari- 
ous researchers. This corpus consists 
of 485 documents whose task consists 
of uniquely identifying speaker name, 
starting time, ending time and location 
of each seminar. 



Similar to other experiments [3, 4] concerning this dataset, we report the 
results on Table 1 which are based on 50/50 split of the corpus being averaged 
over five runs. Our system performs comparably to the best system in each 
category, while clearly outperforming all other systems in finding location whose 
extraction is particularly boosted by our approach. Moreover, the eHMM does 
not show the same drawback as that of the traditional HMM method [3] in which 
sparsely trained states tend to emit those tokens that have never seen during 
training. 



Table 1. 



System 


stime 


etime 


location 


speaker 




F-me. 


F-me. 


F-me. 


F-me. 


eHMM 


95.9 


95.4 


88.6 


70.3 


(LP)2 


99.0 


95.4 


75.0 


77.5 


HMM 


99.1 


59.5 


83.9 


71.1 


Rapier 


95.9 


96.7 


73.4 


52.9 


SNoW-IE 


99.6 


96.3 


75.2 


73.8 



4 Conclusion 

We have proposed an approach to learning models for IE by combining the 
previous HMM-based IE method with the extraction pattern learning method. 
A natural extension of our work is to find a more complete way of merging HMM 
and extraction pattern learning into one single model. 
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Finding term translations as cross-lingual spelling variants on the fly is an important 
problem for cross-lingual information retrieval (CLIR). CLIR is typically approached 
by antomatically translating a query into the target language. For an overview of cross- 
lingual information retrieval, see [1]. When automatically translating the qnery, special- 
ized terminology is often missing from the translation dictionary. The analysis of query 
properties in [2] has shown that proper names and technical terms often are prime keys 
in queries, and if not properly translated or transliterated, query performance may dete- 
riorate significantly. As proper names often need no translation, a trivial solution is to 
include the untranslated keys as such into the target language query. However, technical 
terms in European languages often have common Greek or Latin roots, which allows for 
a more advanced solution nsing approximate string matching to find the word or words 
most similar to the sonrce keys in the index of the target language text database [3]. 

In European languages the loan words are often borrowed with minor but language 
specific modifications of the spelling. A comparison of methods applied to cross-lingual 
spelling variants in CLIR for a number of European langnages is provided in [4]. They 
compare exact match, simple edit distance, longest common subseqnence, digrams, tri- 
grams and tetragrams as well as skipgrams, i.e. digrams with gaps. Skipgrams perform 
best in their comparison with a relative improvement of 7.5 % on the average on the 
simple edit distance baseline. They also show that among the baselines, the simple edit 
distance baseline is in general the hardest baseline to beat. They use no explicit n-gram 
transformation information. In [5], explicit n-gram transformations are based on di- 
grams and trigrams. Trigrams are better than digrams, but no comparison is made to 
the edit distance baseline. In both of the previons studies on European languages most 
of the distance measures for finding the closest matching transformations is based on a 
bag of n-grams ignoring the order of the n-grams. 

Between languages with different writing systems foreign words are often borrowed 
based on phonetic rather than orthographic transliterations. In [6], a generative model is 
introduced which transliterates words from Japanese to English using weighted finite- 
state transducers. The transducer model only nses context-free transliterations which 
do not account for the fact that a sound may be spelled differently in different contexts. 
This is likely to produce heavily overgenerating systems. 

Assume that we have a word in a foreign language. We call this the source word S. 
We want to know the possible meanings of the word in a language known to ns without 
having a translation dictionary. We take the word and compare it to all the words in a 
word list L of the target langnage in order to determine which target word T is most 
similar to the unknown word. In the beginning we only compare how many letters or 
sounds are similar. As we learn the regularities involved, we observe that the likelihood 
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for insertion, deletion and replacement for each letter or sound is different in different 
contexts. To find the most likely target word for any given source word, we need to 
maximize the probability P{T\S), i.e. argmaxTei P{T\S) . 

The first contribution of this work is to show that a distance measure which ex- 
plicitly accounts for the order of the letter or sound n-grams, significantly outperforms 
models based on unordered bags of n-grams. The second contribution is to efficiently 
implement an instance of the the general edit distance with weighted finite-state trans- 
ducers using context sensitive transliterations. The costs for the edit distance are learned 
from a training sample of term pairs. The third contribution of this work is to demon- 
strate that the model needs little or no adaptation for covering new language pairs and 
that the model is robust, i.e. adding a new language does not adversely affect the per- 
formance of the model for the already trained languages. 

Against an index of a large English newspaper database we achieve 80-91 % preci- 
sion at the point of 100 % recall for a set of medical terms in Danish, Dutch, French, 
German, Italian, Portuguese and Spanish. On the average this is a relative improvement 
of 26 % on the simple edit distance baseline. Using the medical terms as training data 
we achieve 64-78 % precision at the point of 100 % recall for a set of terms from varied 
domains in French, German, Italian, Spanish, Swedish and Finnish. On the average this 
is a relative improvement of 23 % on the simple edit distance baseline. For Swedish 
there is no training data and for Finnish, i.e. a language from a different language fam- 
ily, we need only a small amount of training data for adapting the model. In addition, 
the model is reasonably fast. 
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Abstract. The suffix tree and the suffix array are fundamental full- 
text index data structures and many algorithms have been developed on 
them to solve problems occurring in string processing and information 
retrieval. Some problems are solved more efficiently using the suffix tree 
and others are solved more efficiently using the suffix array. We consider 
the index data structure with the capabilities of both the suffix tree 
and the suffix array without requiring much space. For the alphabets 
whose size is negligible, Abouelhoda et al. developed the enhance suffix 
array for this purpose. It consists of the suffix array and the child table. 

The child table stores the parent-child relationship between the nodes in 
the suffix tree so that every algorithm developed on the suffix tree can 
be run with a small and systematic modification. Since the child table 
consumes moderate space and is constructed very fast, the enhanced 
suffix array is almost as time/space-efficient as the suffix array. However, 
when the size of the alphabet is not negligible, the enhance suffix array 
loses the capabilities of the suffix tree. The pattern search in the enhanced 
suffix array takes 0{m\S\) time where m is the length of the pattern 
and S is the alphabet, while the pattern search in the suffix tree takes 
0(m log IT’D time. 

In this paper, we improve the enhanced suffix array to have the capa- 
bilities of the suffix tree and the suffix array even when the size of the 
alphabet is not negligible. We do this by presenting a new child table, 
which improves the enhanced suffix array to support the pattern search 
in 0(m log IT’D time. Our index data structure is almost as time/space- 
efficient as the enhanced suffix array. It consumes the same space as the 
enhanced suffix array and its construction time is slightly slower (< 4%) 
than that of the enhanced suffix array. In a different point of view, it 
can be considered the first practical one facilitating the capabilities of 
suffix trees when the size of the alphabet is not negligible because the 
suffix tree supporting 0(m log |X'|)-time pattern search is not easy to 
implement and thus it is rarely used in practice. 

* This research was supported by the Program for the Training of Graduate Students 
in Regional Innovation which was conducted by the Ministry of Commerce, Industry 
and Energy of the Korean Government. 
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1 Introduction 

The full-text index data structure for a text incorporates the indices for all 
the suffixes of the text. It is used in numerous applications [7], which are exact 
string matching, computing matching statistics, finding maximal repeats, finding 
longest common substrings, and so on. Two fundamental full-text index data 
structures are the suffix tree and the suffix array. 

The suffix tree of text T due to McCreight [15] is a compacted trie of all 
suffixes of T. It was designed as a simplified version of Weiner’s position tree [17]. 
If the size of the alphabet is negligible, the suffix tree for text T of length 
n, consumes 0{n) space and can be constructed in 0{n) time [4,5,15,16]. In 
addition, a pattern P of length m can be found in 0{m) time in the suffix tree. 
If the size of the alphabet is not negligible, the size of the suffix tree and the 
search time in the suffix tree is affected by the data structure for node branching. 
There are three different types of such data structures which are the array, the 
linked list, and the balanced search tree. If the data structure is an array, the 
size of the suffix tree is 0{n\P\) where S is the set of alphabets and searching 
the pattern P takes 0{m) time. If it is a linked list, the size and of the suffix tree 
is 0{n) and searching the pattern P takes 0(7711171) time. If the data structure 
is a balanced search tree, the size of the suffix tree is 0(n) and searching the 
pattern P takes 0(m log |I7|) time. Thus, when the size of the alphabet is not 
negligible, only the balanced search tree is an appropriate data structure for node 
branching. However, using the balanced search tree as the data structure for node 
branching makes the suffix tree hard to implement and it also contributes a quite 
large hidden constant to the space complexity of the suffix tree. Thus, the suffix 
tree supporting 0(m log |T'|)-time pattern search is rarely used in practice. 

The suffix array due to Manber and Myers [14] and independently due to 
Gonnet et al. [6] is basically a sorted list of all the suffixes of the string. The 
suffix array is developed as a space-efficient alternative to the suffix tree. It 
consumes only 0(n) space even though the size of the alphabet is not negligible. 
Since it is developed as a space-efficient full-text index data structure, it was not 
so time-efficient as the suffix tree when it was introduced. It took 0{n log n) time 
for constructing the suffix array^ and 0 (?t 7 -I- logn) time for pattern search even 
with the Icp (longest common prefix) information. However, researchers have 
tried to make the suffix array as time-efficient as the suffix tree. Recently, almost 
at the same time, three different algorithms have been developed to directly 
construct the suffix array in 0{n) time by Kim et al [12], Ko and Aluru [13], and 
Karkkainen and Sanders [10]. In addition, practically fast algorithms for suffix 
array construction have been developed by Larsson and Sadakane [8] , and Kim, 
Jo, and Park [11]. 

Although the suffix array is becoming more time-efficient, the suffix tree 
still has merits because some problems can be solved in a simple and efficient 

^ The suffix array could be constructed in 0{n) time if we first constructed the suffix 
tree and then the suffix array from the suffix tree. However, constructing the suffix 
array in this way is not space-efficient. 
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manner using the suffix tree. Thus, there has been an effort to develop a full-text 
index data structure that has the capabilities of the suffix tree and the suffix 
array without requiring much space. When the size of the alphabet is negligible, 
Abouelhoda et al. [1,2] developed the enhanced suffix array for this purpose. It 
consists of the suffix array and the child table. The child table stores the parent- 
child relationship between the nodes in the suffix tree whose data structure for 
node branching is the linked list. On the enhanced suffix array, every algorithm 
developed on the suffix tree can be run with a small and systematic modification. 
Since the child table is an array of n elements and it is constructed very fast, 
the enhanced suffix array is still space-efficient. However, when the size of the 
alphabet is not negligible, the enhance suffix array loses the power of the suffix 
tree. The pattern search in the enhanced suffix array takes 0(m\S\) time, while 
the pattern search in the suffix tree takes 0(m log IT’D time. This is because the 
child table stores the information about the suffix tree whose data structure for 
node branching is the linked list. 

In this paper, we present an efficient index data structure having the capa- 
bilities of the suffix tree and the suffix array even when the size of the alphabet 
is not negligible. We do this by presenting a new child table storing the parent- 
child relationship between the nodes in the suffix tree whose data structure for 
node branching is the complete binary tree. With this new child table, one can 
search the pattern P in 0(m log |T|) time. Our index data structure is almost 
as time/space-efficient as the enhanced suffix array. It consumes the same space 
as the enhanced suffix array and its construction time is slightly slower (< 3%) 
than that of the enhanced suffix array. In addition, since the construction time 
of the enhanced suffix array is also slightly slower than that of the suffix array, 
our index data structure can be constructed almost as fast as the suffix arrays. 
In a different point of view, it can be considered the first practical one facilitat- 
ing the capabilities of suffix trees when the size of the alphabet is not negligible 
because the suffix tree supporting 0(mlog |T|)-time pattern search is not easy 
to implement and thus it is rarely used in practice. 

We describe the main difficulties to make our index data structure almost as 
time/space-efficient as the enhanced suffix array, and the techniques to overcome 
the difficulties. 

• Our child table is an incorporation of four conceptual arrays up, down, 
Ichild, and rchild, while the previous child table is that of three conceptual 
arrays up, down, and nextllndex: We developed a new incorporation tech- 
nique that can store the four conceptual arrays into an array of n elements 
which is the same space as the previous child table consumes. 

• The structural information of the complete binary tree is not easily obtained 
directly from the Icp table: We developed an Icp extension technique that 
extends the Icp to reflect the structure of the complete binary tree. This Icp 
extension technique requires the right-to-left scan of the bit representations 
of n integers to find the rightmost 1 in the bit representations. It seems to 
take O(nlogn) time at first glance, however, it can be shown that it takes 
0{n) time by resorting to amortized analysis. 
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Fig. 1. The enhanced suffix array and the suffix tree of accttacgacgaccttcca^. The 
suffix tree uses the linked list for node branching. 



• During the pattern search, we have to distinguish the elements of child table 
from up or down and the elements from Ichild or rchild. However, the child 
table does not indicate this. To solve this problem, we use the Icp array. If 
lcp[f] = lcp[cldtab[i]], cldtab[f] is from Ichild or rchild. Otherwise, 
cldtab[f] is from up or down. 

We introduce some notations and definitions in Section 2. In Section 3, we 
introduce the enhanced suffix array. In Section 4, we describe our index data 
structure and an algorithm to generate it. In Section 5, we measure the perfor- 
mance of our index data structure by experiments and compare it with that of 
the enhanced suffix array. We conclude in Section 6. 

2 Preliminaries 

Consider a string S of length n over an alphabet S. Let 5'[i] for 1 < i < n denote 
the ith symbol of string S. We assume that ^[n] is a special symbol # which 
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is lexicographically larger than any other symbol in S. The sujjix array of S 
consists of the pos and Icp arrays. The array pos[l..n] is basically a sorted list 
of all the suffixes of S. However, suffixes themselves are too heavy to be stored 
and thus only the starting positions of the suffixes are stored in the pos array. 
Figure 1 shows an example of the pos array of accttacgacgaccttcca^. We will 
consider, in this paper, the starting position of a suffix as the suffix itself. 

The array lcp[l..n] is an array that stores the lengths of the longest common 
prefix of two adjacent suffixes in the pos array. We store in lcp[f], 2 < i < n, 
the length of the longest common prefix of pos[f — 1] and pos[t]. We store 0 
in lcp[l]. For example, in Fig. 1, lcp[3] = 2 because the length of the longest 
common prefix of pos[2] and pos[3] is 2. 

The Icp-interval of the suffix array of S', corresponding to the node in the 
suffix tree of S, is defined as follows [1,2]. 

Definition 1. Interval [i-.j], 0 < i < j < n, is an Icp-interval of Icp-value I 
(I -interval), if 

1. lcp[f] < I, 

2. lcp[fc] > I for all i -\- I < k < j, 

3. lcp[fc] = I for some i + 1 < fc < j if i ^ j and I = n — i -\- 1 if i = j, and 

4 - lcp[j + 1] < b 

For example, in Fig. 1, interval [1..4] is a 2-interval because lcp[l] < 2, lcp[fc] > 2 
for all 2 < /c < 4, lcp[3] = 2, and lcp[5] < 2. The prefix of an Icp-interval [i-.j] 
is the longest common prefix of the suffixes in pos[i..j]. Fig. 1 shows the one-to- 
one correspondence between the Icp-intervals in the suffix array and the nodes 
in the suffix tree. The parent-child relationship between the Icp-intervals are the 
same as that between the corresponding nodes in suffix trees. That is, an Icp- 
interval [i-.j] is a child interval of another Icp-interval [k..l] if the corresponding 
node of [i-.j] is a child of the corresponding node of [k.l]. An Icp-interval [i-.j] 
is the parent interval of an Icp-interval [k..l] if [k..l] is a child interval of [i-.j]- 
For example, in Fig. 1, [1..4] is a child interval of [1..5] and [1..5] is the parent 
interval of [1..4]. 

3 The Enhanced SnfRx Array 

The enhanced suffix array due to Abouelhoda et al. [1,2] consists of the suffix 
array and the child table. The child table cldtab is an incorporation of three 
conceptual arrays up, down, and nextllndex. They store the information about 
the structure of the suffix tree whose data structure for node branching is the 
linked list. This suffix tree is slightly differ from the traditional suffix tree in 
that the linked list does not include the first child interval. A tree edge connects 
an Icp-interval to its second child interval and a link in the list connects an fth, 
i > 2, child interval to its next sibling, i.e., the {i -\- l)st child interval. The 
information about the tree edges is stored by the arrays up and down and the 
information about the links in the linked list is stored by the array nextllndex. 
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The meanings of formal definitions^ of arrays up, down, and next 1 Index are as 
follows. 

~ The element up[t] stores the first index of the second child interval of the 
longest Icp-interval ending at index i — 1. 

- The element down[z] stores the first index of the second child interval of the 
longest Icp-interval starting at index i. 

- The element nextllndex[t] stores the first index of the next sibling interval 
of the longest Icp-interval starting at index i if and only if the interval is 
neither the first child nor the last child of its parent. 

In Fig. 1, up[14] stores 7 which is the first index of [7. .9] that is the second child 
interval of [6.. 13] which is the longest Icp-interval ending at index 13. Element 
down[7] stores 8 which is the first index of [8. .9] that is the second child interval of 
[7. .9] which is the longest Icp-interval starting at index 7. Element nextllndex[7] 
stores 10 which is the first index of [10. .11] that is the next sibling interval of 
[7.. 9] which is the longest Icp-interval starting at index 7. 

We first show how to find the first child interval of an Icp-interval and then 
other child intervals. To find the first child interval of a given Icp-interval we 
compute the first index a of the second child interval of [i..j]. (If a is computed, 
one can find the first child interval [i..a — 1] easily.) The value a is stored in 
up[j -I- 1] or down[i]. It is stored in up[j -|- 1] if [i..j] is not the last child interval 
of its parent and in down[I], otherwise. If [i-.j] is not the last child interval of its 
parent, [i-.j] is the longest interval ending at index j and thus up[j -|- 1] stores 
a. Otherwise, [i-.j] is shorter than its parent interval [k..j], k < i, and thus [i.j] 
is not the longest interval ending at j. In this case, however, is the longest 
interval starting at i and thus down[i] stores a. 

We show how to find the /cth, k > 2, child interval of [i-.j]- We first define 
next 1 Index'’ [a] as nextllndex[nextllndex'’“^[a]] recursively. Then, the first 
index of the fcth child interval is represented by nextllndex^“^[a] where a is 
the first index of the second child interval. The last index of the /cth child interval 
is nextllndex*^”^ [a] — 1 if it is not the last child and it is j, otherwise. 

Abouelhoda et al. [1,2] showed that only n elements among the 3n elements 
of arrays up, down, and nextiindex, are necessary and we get the following 
lemma. 

Lemma 1. Only n elements among the 3n elements of arrays up, down, and 
nextiindex are necessary and they can he stored in the child table of n ele- 
ments [1, 2]. 

The procedures UP-DOWN and NEXT in Fig. 2 compute the arrays up and 
down, and nextiindex respectively. Each procedure runs in 0(n) time. The 
analysis of running time of these procedures, in addition to the proof of their 
correctness, are given in [1] . 

^ up[i] = min{g £ [0..i — l]|lcp[g] > Icp[i] and Vfc £ [g -t l..i — 1] : lcp[fc] > lcp[g]}. 
down[i] = max{g £ [i -I- l..n]|lcp[g] > Icp[i] and Vfc £ [i -|- l..q — 1] : Icp[fc] > lcp[g]}. 
nextiindex]!] = min{g £ [i -|- l..n]|lcp[g] = lcp[i] and Vfc £ [i -|- l..q — 1] : lcp[fc] > 
Icp]!]}. 
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Procedure UP-DOWN 
1: lastindex := —1; 

2: push(O); 

3: for i := 1 to n do 

4: while lcp[i] < lcp\top] do 

5: lastindex ~ pop; 

6 : iflcp[i] < lcp[top] and lcp[top] 7 ^ lcp[/ast/ndea;] then 

7: down[top] ~ lastindex 

8 : if lastindex 7^—1 then 

9: up[i] := lastindex’, 

10: lastindex := —1; 

11 : push(i) 

end 

Procedure NEXT 
1: push(O); 

2 : for i := 1 to n do 

3: while lcp[i] < lcp[top] do 

4: pop; 

5: if lcp[i] = lcp[top] then 

6 : lastindex := pop; 

7: nextllndex := i 

8 : push(i) 

end 

Fig. 2. Procedures UP-DOWN and NEXT. 

4 The New Child Table 

Our index data structure consists of the suffix array and a new child table. The 
new child table cldtab, stores the information about the suffix tree whose data 
structure for node branching is the complete binary tree. Figure 3 shows a suffix 
tree for accttacgacgaccttcca^f whose data structure for node branching is the 
complete binary tree. In this suffix tree, the child intervals except the first child 
interval of an Icp-interval form a complete binary tree. Let the root child of 
[i..j] denote the root Icp-interval of the complete binary tree. In Fig. 3, each solid 
line is the edge connecting an Icp-interval to its root child interval and dashed 
lines are the edges connecting the sibling intervals such that they form complete 
binary trees. Our child table is an incorporation of four conceptual arrays up, 
down, Ichild, and rchild. The arrays up and down stores the information about 
the solid edges and the arrays Ichild and rchild stores the information about 
the dashed edges. 

We describe the definitions of the arrays up, down, Ichild, and rchild. The 
element up[t] stores the first index of the root child of the longest interval ending 
at index i — 1 and the element down[f] stores the first index of the root child of 
the longest interval starting at i. The element lchild[f] (resp. rchild[i]) stores 
the first index of the left (resp. right) child of the longest interval starting at i 
in the complete binary tree, which is a sibling in the suffix tree. 

We show that only n elements of the 4n elements of those arrays up, down, 
Ichild, and rchild are necessary and they can be stored in the child table of 
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Fig. 3. Our index data structure and the sufHx tree of accttacgacgaccttcca#. The suffix 
tree uses the complete binary tree for node branching. 



n elements. We have only to show that the number of solid and dashed edges 
are n. We first count the number of outgoing edges from all the children of an 
Icp-interval in the following lemma. 

Lemma 2. The number of outgoing edges from the children of a non-singleton 
Icp-interval x are 2 qx — q'^ — 2 where Qx is the number of children of x and q'^ is 
the number of singleton children of x. 

Proof. The number of outgoing edges from the children of x are equal to C + i?, 
where C is the number of dashed edges in the complete binary tree for the 
children of x and R is the number of solid edges to the root children of the 
children of x. Since C is, qx — 2 (because there are qx — 1 children of x in the 
complete binary tree and a complete binary tree with — 1 nodes has qx — 2 
edges) and Ris qx~ q'x (because singleton children have no root children), C-\-R 
becomes 2qx — q'x — 2. 

From Lemma 2, we can derive the following theorem. 
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Theorem 1. Only n elements among An elements of arrays up, down, Ichild, 
and rchild are necessary and they can he stored in the child table of n elements. 

Proof. The main part of this proof is to show that the total number of outgoing 
edges from every non-singleton interval is n. The details are omitted. 

We show how to compute the child table. The main idea is that arrays Ichild 
and rchild are not different from the arrays up and down if we do not differen- 
tiate the edges of the suffix trees and those of the complete binary trees. Not to 
differentiate those edges, we use the Icp extension technique. We use a temporary 
array depth to store the extended part of the Icp. For easy explanation, we use 
a conceptual array hgt where hgt[z] is a concatenation of lcp[z] and depth[z]. 
The computation of the arrays consists of the following three steps. 

1. Compute the number of children for every Icp-interval: We can do this in 
0{n) time by running the procedure NEXT in Fig. 2. 

2. For each child interval except the first child interval of Icp-interval [i..j], com- 
pute the depth of it in the complete binary tree: Let [ck..Ck+i ~ k > 2, de- 
note the /cth child interval. We compute the depth of the interval [cfc..Cfe+i — 1] 
and store it in depth[cfc]. We only describe how to compute the depth when 
all leaves are at the same level. (Computation of the depth is slightly differ- 
ent when all leaves are not at the same level.) We compute the depth D of 
the complete binary tree, and the level Lfe of the fcth child in the complete 
binary tree. Once D and Lj, is computed, the depth the fcth child is easily 
computed because it is ZJ — Lfe. Since computing D is straightforward, we 
only describe how to compute L^. The corresponds to the number of 
tailing O’s in the bit representation of k. For example, every odd numbered 
child has no tailing O’s and thus the level of it is 0. We consider the running 
time of computing the depths of all q children in the complete binary tree. To 
determine the depths of all q nodes, we have to scan the bit representation 
from the right until we reach the rightmost 1 for all integers 1, 2, ..., q. One 
can show this takes 0{q) time overall by resorting to the amortized analysis 
which is very similar to the one used to count the bit-flip operations of a 
binary counter [3]. Overall, this step takes 0{n) time. 

3. Compute the arrays up' and down' with the hgt and store them in the child 
table: We do this in 0{n) time by running the procedure UP-DOWN in Fig. 2, 
and storing up'[i] in cldtab[z — 1] and down'[z] in cldtab[z]. The elements of 
arrays up' and down' correspond to the elements of arrays up, down, Ichild, 
and rchild computed from the Icp. If the longest interval starting at index 
i is an internal node, up'[z] = lchild[z] and down'[z] = rchild[z]. If it is a 
leaf, down'[z] = down[z] or up'[j -|- 1] = up[j -|- 1]. 

Theorem 2. The new child table cldtab can he constructed in 0(n) time. 

We consider the pattern search in our index data structure. The pattern 
search starts at the root child [i..j] of [l..n]. If a prefix of the pattern matches 
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Fig. 4. The experimental results for the pattern search in the enhanced suffix array 
and ours. We measured the running time for performing the pattern search for 10® 
number of patterns of lengths between 300 and 400. 



the prefix of we move to the the root child of [i..j] using up[j + 1] or 

down[f]. Otherwise (if a mismatch occurs), we move to one of the sibling using 
lchild[t] or rchild[i]. In this way, we proceeds the pattern search until we 
find the pattern or we are certain that the pattern does not exist. However, 
the child table does not indicate whether an element of the child table is from 
up or down, or from Ichild or rchild. To solve this problem, we use the Icp 
array. If lcp[i] = lcp[cldtab[f]], cldtab[i] is from Ichild or rchild. Otherwise, 
cldtab[i] is from up or down. Thus, with exploiting both arrays cldtab and Icp, 
we can search a pattern in 0(m log IT'D time. 

Theorem 3. The new child table cldtab and Icp array support the 0{mlog |T|) 
-time pattern search in the worst case. 



5 Experimental Results 

We measure the search time in our index data structure and that in the enhanced 
suffix array. In addition, we also measure the construction time of the suffix array, 
the enhanced suffix array, and our index data structure. We generated different 
kinds of random strings which are differ in lengths (IM, lOM, SOM, and 50M) 
and in the sizes of alphabets (2, 4, 20, 64, and 128) from which they are drawn. 
We measured the running time in second on the 2.8Ghz Pentium IV with 2GB 
main memory. 

Figure 4 compares the pattern search time in the enhanced suffix array with 
that in our index data structure. It shows that the pattern search in our index 
data structure is faster when the size of alphabet is larger than 20 regardless 
of the length of the random string. Moreover, the ratio of the pattern search 
time of the enhanced suffix array to that of ours becomes larger as the size of 
alphabet becomes large. These experimental results are consistent with the time 
complexity analysis of the pattern search. 

Figure 5 compares the construction time of the suffix array, the enhanced 
suffix array, and our index data structure. The construction time of our index 
data structure is at most 4% slower than that of the enhanced suffix array. In 
addition, the construction time for the child table is almost negligible compared 
with the construction time for the pos and Icp arrays. Thus, we can conclude 
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Fig. 5. We computed the percentage of the construction time of our index data struc- 
ture over that of the enhanced suffix array. The construction time for the enhanced 
suffix array (ESA) is the construction time for the arrays pos, Icp, and cldtab. The 
construction time for our data structure is the construction time for the arrays pos, Icp, 
and new cldtab. To construct the pos array, we considered Larsson and Sadakane’s [8] 
(LS) algorithm, Karkkainen and Sanders’ [10] (KS) algorithm, and Kim, Jo, and 
Park’s [11] (KJP) algorithm. Among the algorithms, KJP algorithm is the fastest in 
most cases, we used KJP algorithm to construct the pos array. To construct the Icp 
array, we used Kasai et al. [9]’s algorithm. 



that our data structure can be constructed almost as fast as the suffix array and 
the enhanced suffix array. 

6 Conclusion 

We presented an index data structure with the capabilities of the suffix tree and 
the suffix array even when the size of the alphabet is not negligible by improving 
the enhanced suffix array. Our index data structure support the pattern search 
in 0(mlog|A'|) time and it is almost as time/space-efficient as the enhanced 
suffix array. In a different point of view, it can be considered the first practical 
one facilitating the capabilities of suffix trees when the size of the alphabet is not 
negligible because the suffix tree supporting 0(m log |A’|)-time pattern search is 
not easy to implement and thus it is rarely used in practice. 
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Abstract. We show that, by combining an existing compression boost- 
ing technique with the wavelet tree data structure, we are able to design 
a variant of the FM-index which scales well with the size of the input 
alphabet S. The size of the new index built on a string T[l, n] is bounded 
by nHk{T)+0(^{n log log n)/ logj^l n) bits, where Hk{T) is the fc-th order 
empirical entropy of T. 

The above bound holds simultaneously for all k < alog|j;| n and 0 < 
a < 1. Moreover, the index design does not depend on the parameter k, 
which plays a role only in analysis of the space occupancy. 

Using our index, the counting of the occurrences of an arbitrary pat- 
tern P[l,p] as a substring of T takes 0(p log IT’D time. Locating each 
pattern occurrence takes 0(log|X'| (log^ n/ log log n)) time. Reporting a 
text substring of length i takes OijJ. -|- log^ n/ log log n) log IT’D time. 



1 Introduction 

A full-text index is a data structure built over a text string T[l, n] that supports 
the efficient search for an arbitrary pattern as a substring of the indexed text. 
A self-index is a full-text index that encapsulates the indexed text T, without 
hence requiring its explicit storage. 

The FM-index [3] has been the first self-index in the literature to achieve 
a space occupancy close to the fc-th order entropy of T — hereafter denoted by 
Hk{T) (see Section 2.1). Precisely, the FM-index occupies at most 5nHk{T) -\- 
o{n) bits of storage, and allows the search for the occ occurrences of a pattern 
P[l,p] within T in 0(p -|- occ log^^*^ n) time, where e > 0 is an arbitrary constant 
fixed in advance. It can display any text substring of length i in 0{t -b log^'*''^ n) 
time. The design of the FM-index is based upon the relationship between the 
Burrows- Wheeler compression algorithm [1] and the suffix array data struc- 
ture [16,9]. It is therefore a sort of compressed suffix array that takes advantage 
of the compressibility of the indexed text in order to achieve space occupancy 
close to the Information Theoretic minimum. Indeed, the design of the FM-index 
does not depend on the parameter k and its space bound holds simultaneously 

* Partially supported by the Italian MIUR projects ALINWEB and ECD and Grid.it and 
“Piattaf orma distribuita ad alte prestazioni ”, and by the Chilean Fondecyt 
Grant 1-020831. 
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over all fc > 0. These remarkable theoretical properties have been validated by 
experimental results [4,5] and applications [14,21]. 

The above bounds on the FM-index space occupancy and query time have 
been obtained assuming that the size of the input alphabet is a constant. Hidden 
in the big-0 notation there is an exponential dependency on the alphabet size 
in the space bound, and a linear dependency on the alphabet size in the time 
bounds. More specifically, the search time is 0{p + occ \E\ log^^*^ n) and the 
time to display a text substring is 0{{t + log^^*^ n) |i7|). Although in practical 
implementations of the FM-index [4, 5] these dependencies are removed with 
only a small penalty in the query time, it is worthwhile to investigate whether 
it is possible to build a more “alphabet-friendly” FM-index. 

In this paper we use the compression boosting technique [2, 7] and the wavelet 
tree data structure [11] to design a version of the FM-index which scales well 
with the size of the alphabet. Compression boosting partitions the Burrows- 
Wheeler transformed text into contiguous areas in order to maximize the overall 
compression achievable with zero-order compressors used over each area. The 
wavelet tree offers a zero-order compression and also permits answering some 
simple queries over the compressed area. 

The resulting data structure indexes a string T[l,n] drawn from an al- 
phabet S using nHk{T) + 0((nloglogn)/log|2;| n) bits of storage. The above 
bound holds simultaneously for all k < alog| 2 ;| n and 0 < a < 1. The struc- 
ture of our index is extremely simple and does not depend on the parame- 
ter k, which plays a role only in the analysis of the space occupancy. With 
our index, the counting of the occurrences of an arbitrary pattern P[l,p] as a 
substring of T takes 0(p log IT’D time. Locating each pattern occurrence takes 
0(log|T’| (log^n/loglogn)) time. Displaying a text substring of length £ takes 
0((£ + log^ n/ loglogn) log |i7|) time. Compared to the original FM-index, we 
note that the new version scales better with the alphabet size in all aspects. 
Albeit the time to count pattern occurrences has increased, that of locating 
occurrences and displaying text substrings has decreased. 

Recently, various compressed full-text indexes have been proposed in the 
literature achieving several time/space trade-offs [13,20,18,11,12,10]. Among 
them, the one with the smallest space occupancy is the data structure described 
in [11] (Theorems 4.2 and 5.2) that achieves 0{p\og\S\ -|- polylog(n)) time to 
count the pattern occurrences, 0(log|A'| {£ + log^ n/ loglog n)) time to locate 
and display a substring of length £, and uses nHk{T) + 0((nloglogn)/log|2;| n) 
bits of storage. The space bound holds for a fixed k which must be chosen in 
advance, i.e., when the index is built. The parameter k must satisfy the constraint 
k < a log I I n with 0 < a < 1, which is the same limitation that we have for our 
space bound. An alternative way to reduce the alphabet dependence of the FM- 
index has been proposed in [10], where the resulting space bound is the higher 
0{{Ho + l)n) although based on a simpler solution to implement. 

To summarize, our data structure is extremely simple, has the smallest 
known space occupancy, and counts the occurrences faster than the data struc- 
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ture in [11], which is the only other compressed index known to date with a 
nHk{T) + o(n) space occupancy. 

2 Background and Notation 

Hereafter we assume that T[l,n] is the text we wish to index, compress and 
query. T is drawn from an alphabet if of size jZ'j. By T[i] we denote the z-th 
character of T, T[z, n] denotes the ith text suffix, and T[l, i] denotes the zth text 
prefix. We write licl to denote the length of string w. 

2.1 The fc-th Order Empirical Entropy 

Following a well established practice in Information Theory, we lower bound the 
space needed to store a string T by using the notion of empirical entropy. The 
empirical entropy is similar to the entropy defined in the probabilistic setting 
with the difference that it is defined in terms of the character frequencies ob- 
served in T rather than in terms of character probabilities. The key property 
of empirical entropy is that it is defined pointwise for any string T and can 
be used to measure the performance of compression algorithms as a function 
of the string structure, thus without any assumption on the input source. In a 
sense, compression bounds produced in terms of empirical entropy are worst-case 
measures. 

Formally, the zero-th order empirical entropy of T is defined as Ho(T) = 
— log(ni/n), where is the number of occurrences of the z-th alphabet 

character in T, zz = m = |T|, and all logarithms are taken to the base 2 (with 
OlogO = 0). To introduce the concept of k-th order empirical entropy we need 
to define what is a context. A length-fc context zc in T is one of its substrings 
of length k. Given w, we denote by zct the string formed by concatenating all 
the symbols following the occurrences of w in T, taken from left to right. For 
example, if T = mississippi then = sisi and si^ = sp. The fc-th order 
empirical entropy of T is defined as: 

Hk{T) = - \wt\Ho{wt). (1) 

The /c-th order empirical entropy Hk{T) is a lower bound to the output size of 
any compressor which encodes each character of T using a uniquely decipherable 
code that depends only on the character itself and on the k characters preceding 
it. For any fc > 0 we have Hk{T) < loglAj. Note that for strings with many 
regularities we may have H^iT) = o(l). This is unlike the entropy defined in the 
probabilistic setting which is always a constant. As an example, for T = (afc)”/^ 
we have Hq{T) = 1 and Hk{T) = 0((logn)/rz) for any fc > 1. 

2.2 The Burrows- Wheeler Transform 

In [I] Burrows and Wheeler introduced a new compression algorithm based on 
a reversible transformation now called the Burrows- Wheeler Transform (BWT 




An Alphabet-Friendly FM-Index 



153 





F 




rjnbwt 


mississippi# 


T 


mississipp 


i 


ississippi#m 


i 


#mississip 


P 


ssissippi#mi 


i 


ppi#missis 


s 


sissippi#mis 


i 


ssippi#mis 


s 


issippi#miss 


i 


ssissippi# 


m 


ssippi#missi = 


m 


ississippi 


# 


sippi#missis 


P 


i#mississi 


P 


ippi#mississ 


P 


pi#mississ 


i 


ppi#mississi 


s 


ippi#missi 


s 


pi#mississip 


s 


issippi#mi 


s 


i#mississipp 


s 


sippi#miss 


i 


#mississippi 


s 


sissippi#m 


i 



Fig. 1. Example of Burrows- Wheeler transform for the string T = mississippi. The 
matrix on the right has the rows sorted in lexicographic order. The output of the BWT 
is the last column; in this example the string ipssm#pissii. 



from now on). The BWT consists of three basic steps (see Figure 1): (1) append 
at the end of T a special character # smaller than any other text character; 
(2) form a conceptual matrix Air whose rows are the cyclic shifts of the string 
T# sorted in lexicographic order; (3) construct the transformed text by 
taking the last column of matrix A4t- Notice that every column oi A4 t, hence 
also the transformed text is a permutation of T^. In particular the first 

column of A4t, call it F, is obtained by lexicographically sorting the characters 
of (or, equally, the characters of 

We remark that the BWT by itself is not a compression algorithm since 
is just a permutation of T#. However, if T has some regularities the BWT will 
“group together” several occurrences of the same character. As a result, the 
transformed string T**™* contains long runs of identical characters and turns out 
to be highly compressible (see e.g. [1, 17] for details). 

Because of the special character when we sort the rows oi Mt we are 
essentially sorting the suffixes of T. Therefore there is a strong relation between 
the matrix AIt and the suffix array built on T. The matrix A4t has also other 
remarkable properties; to illustrate them we introduce the following notation: 

— Let C[-\ denote the array of length 127] such that C[c] contains the total 
number of text characters which are alphabetically smaller than c. 

— Let Occ(c, q) denote the number of occurrences of character c in the prefix 

— Let LF{i) = C[T'>'^^[i]] + Occ{T’^^^[i],i). 

LF{-) stands for Last-to-First column mapping since the character 
in the last column of AAt, is located in the first column F at position LF{i). 
For example in Figure 1 we have LF{10) = C[s] -|- Occ(s, 10) = 12; and in fact 
T^“’*[10] and F’[LF(10)] = ^[12] both correspond to the first s in the string 
mississippi. 
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The LF{-) mapping allows us to scan the text T backward. Namely, if T[k] = 
then T[k-1] = T'’^*[LF{i)]. For example in Fig. 1 we have that T[3] = s 
is the 10th character of and we correctly have T[2] = T*'“'‘[LF(10)] = 
= i (see [3] for details). 

2.3 The FM-Index 

The FM-index is a self-index that allows to efficiently search for the occurrences 
of an arbitrary pattern P[l,p] as a substring of the text T[l,n]. Pattern P is 
provided on-line whereas the text T is given to be preprocessed in advance. The 
number of pattern occurrences in T is hereafter indicated with occ. The term 
self-index highlights the fact that T is not stored explicitly but it can be derived 
from the FM-index. 

The FM-index consists of a compressed representation of together with 
some auxiliary information which makes it possible to compute in 0(1) time 
the value 0cc{c,q) for any character c and for any q, 0 < q < n. The two 
key procedures to operate on the FM-index are: the counting of the number of 
pattern occurrences (shortly get_rows), and the location of their positions in the 
text T (shortly get_position). Note that the counting process returns the value 
occ, whereas the location process returns occ distinct integers in the range [1, n]. 



Algorithm get_rows(P[l,p]) 

1. i ^ p, c ^ P[p], First ^ C[c\ 1, Last <— C[c + 1]; 

2. while ((First < Last) and {i > 2)) do 

3. c <— P[i — 1]; 

4. First <— C[c\ -F Occ(c, First — 1) -F 1; 

5. Last «— C[c\ -F Occ(c, Last); 

6. i <— i — 1; 

7. if (Last < First) then return “no rows prefixed by P[l,p]” else return 
(First, Last). 



Fig. 2. Algorithm get_rows for finding the set of rows prefixed by P[l,p], and thus for 
counting the pattern’s occurrences occ = Last — First + 1. Recall that C[c] is the number 
of text characters which are alphabetically smaller than c, and that Occ(c, q) denotes 
the number of occurrences of character c in q\. 



Figure 2 sketches the pseudocode of the counting operation that works in p 
phases, numbered from p to 1. The i-th phase preserves the following invariant: 
The parameter First points to the first row of the BWT matrix Aip prefixed by 
P[i,p], and the parameter Last points to the last row of Mp prefixed by P[i,p\. 
After the final phase, P prefixes the rows between First and Last and thus, 
according to the properties of matrix Mr (see Section 2.2), we have occ = 
Last — First -F 1. It is easy to see that the running time of get_rows is dominated 
by the cost of the 2p computations of the values Occ( ). 
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Algorithm get_position(i) 

1. i' ^ i, t ^ 0; 

2. while row i' is not marked do 

3. i' ^ LF[i'\, 

4. t^t + 1; 

5. return Pos(i') -|- t; 



Fig. 3. Algorithm get_position for the compntation of Pos(i). 



Given the range (First, Last), we now consider the problem of retrieving the 
positions in T of these pattern occurrences. We notice that every row in Ad t is 
prefixed by some suffix of T . For example, in Fig. 1 the fourth row of M.t is pre- 
fixed by the text suffix T[5, 11] = issippi. Then, for i = First, First -F 1, ... , Last 
we use procedure get_position(i) to find the position in T of the suffix that pre- 
fixes the i-th row Such a position is denoted hereafter by Pos(t), and 

the pseudocode of get_position is given in Figure 3. The intuition underlying 
its functioning is simple. We scan backward the text T using the LF{-) map- 
ping (see Section 2.2) until a marked position is met. If we mark one text po- 
sition every 6>(log^ n/ loglog n), the while loop is executed 0(log^ n/loglogn) 
times. Since the computation of LF{i) can be done via at most |i7| computa- 
tions of Occ(), we have that get.position takes 0(127] (log^ n/ loglog n)) time. 
Finally, we observe that marking one position every 0(log^ n/ loglog n) takes 
0{n log log n/ log n) bits overall. Combining the observations on get_position with 
the ones for get_rows, we get [3]: 

Theorem 1. For any string T[l,n] drawn from a constant- sized alphabet 27, 
the FM-index counts the occurrences of any pattern P[l,p] within T taking 0{p) 
time. The location of each pattern occurrence takes 0{\S\ log^ n/ loglog n) time. 
The size of the FM-index is hounded by 5nFlk{T) -F o(n) bits, for any k > 0. 

In order to retrieve the content of T[l,r], we must first find the row in Adr 
that corresponds to r, and then issue £ = r — I -\- 1 backward steps in T, using 
the LF{-) mapping. Starting at the lowest marked text position that follows 
r, we perform 0(log^ n/ loglogn) steps until reaching r. Then we perform £ 
additional LF-steps to collect the text characters. The resulting complexity is 
0((^-Flog^n/loglogn) ]27|). 

We point out the existence [6] of a variant of the FM-index that achieves 
0(p -F occ) query time and uses 0{nHk{T)\o^ n) -F o(n) bits of storage. This 
data structure exploits the interplay between the Burrows- Wheeler compression 
algorithm and the LZ78 algorithm [22]. Notice that this is first full-text index 
achieving o(n log n) bits of storage, possibly o{n) on highly compressible texts, 
and output sensitivity in the query execution. 

As we mentioned in the Introduction, the main drawback of the FM-index 
is that, hidden in the o(n) term of the space bound, there are constants which 
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depend exponentially on the alphabet size |i7|. In Section 3 we describe a simple 
alternative representation of which takes nHk{T) + 0{\og |^| bits 

and allows the computation of Occ(c,g) and in 0(log|I7|) time. 

2.4 Compression Boosting 

The concept of compression boosting has been recently introduced in [2, 7, 8] 
opening the door to a new approach to data compression. The key idea is that one 
can take an algorithm whose performance can be bounded in terms of the 0-th 
order entropy and obtain, via the booster, a new compressor whose performance 
can be bounded in terms of the fc-th order entropy, simultaneously for all k. 
Putting it another way, one can take a compression algorithm that uses no 
context information at all and, via the boosting process, obtain an algorithm 
that automatically uses the “best possible” contexts. 

To simplify the exposition, we now state a boosting theorem in a form which 
is slightly different from the version described in [2,7]. However, the proof of 
Theorem 2 can be obtained by a straightforward modification of the proof of 
Theorem 4.1 in [7]. 

Theorem 2. Let A be an algorithm which compresses any string s in less than 
|s|i7o(s) -I- /(|s|) bits, where /(•) is a non decreasing concave function. Given 
T[l, n] there is a 0{n) time procedure that computes a partition s\, S 2 , ■ ■ ■ , Sz of 
Tbwt gyg/j that, for any k > 0, we have 

< E(k*l^o(s.) + /(|si|)) < nHk{T) + \S\'^f{n/\S\'^). 

i=l i=l 

Proof. (Sketch). According to Theorem 4.1 in [7], the booster computes the 
partition that minimizes the function X)i=i l•s|^o(•Si) + f{\si\). To determine 
the right side of the above inequality, we consider the partition si, § 2 , . . . , Sm 
induced by the contexts of length k in T. For such partition we have m < \S\^ 
and = nHk(T). The hypothesis on / implies that /(|s*l) ^ 

&nd the theorem follows. ■ 

To understand the relevance of this result suppose that we want to compress 
T[l,n] and that we wish to exploit the zero-th order compressor A. Using the 
booster we can first compute the partition Si, S 2 , . . . , Sz of and then com- 

press each Si using A. By the above theorem, the overall space occupancy would 
be bounded by |A(si)| < nHk(T) + | A’|^/(n/| A|^). Note that the process is 
reversible, because the decompression of each Si retrieves and from 

we can retrieve T using the inverse BWT. Summing up, the booster allows us to 
compress T up to its fc-th order entropy using only the zero-th order compressor 
A. Note that the parameter k is neither known to A nor to the booster, it comes 
into play only in the space complexity analysis. Additionally, the space bound 
in Theorem 2 holds simultaneously for all A: > 0. The only information that is 
required by the booster is the function f{n) such that |s|i7o(s) -I- /(|s|) is an 
upper bound on the size of the output produced by A on input s. 
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2.5 The Wavelet Tree 

Given a binary sequence S'[l, m] and b € {0, 1}, consider the following operations: 
Rankf,(S', i) computes the number of 6’s in and Selecth(S', i) computes the 

position of the i-th b in In [19] it has been proven the following: 

Theorem 3. Let S'[l,m] be a binary sequence containing t occurrences of the 
digit 1. There exists a data structure (called FIDJ that supports Rankh(5', i) and 
Selecth(S', i) in constant time, and uses [log (™)] -|- 0((m log log m)/ log m) = 
mHo{S) + 0{{mloglogm) / logm) bits of space. 

If, instead of a binary sequence, we have a sequence W[l, w] over an arbitrary 
alphabet S, a compressed and indexable representation of W is provided by the 
wavelet tree [11] which is a clever generalization of the FID data structure. 

Theorem 4. Let W[l, w] denote a string over an arbitrary alphabet S. The 
wavelet tree built on W uses wHq{W) -|-0(log|A'| (ic log log ic) / log re) bits of 
storage and supports in 0(log|A’|) time the following operations: 

— given q, 1 < q < w, the retrieval of the character FF[g]; 

— given c G S and q, ^ < q < w, the computation of the number of occurrences 

Occw(c,q) of c in IF [1,(7]. 

To make the paper more self-contained we recall the basic ideas underlying 
the wavelet tree. Consider a balanced binary tree T whose leaves contain the 
characters of the alphabet S. T has depth 0(log|A'|). Each node u of T is 
associated with a string 1F„ that represents the subsequence of IF containing 
only the characters that descend from u. The root is thus associated with the 
entire IF. To save space and be alphabet-friendly, the wavelet tree does not 
store 1F„ but a binary image of it, denoted by that is computed as follows: 
Bu[i] = 0 if the character lF„[i] descends from the left child of u, otherwise 
Bu[i] = 1. Assume now that every binary sequence is implemented with the 
data structure of Theorem 3; then it is an exercise to derive the given space 
bounds and to implement Occw{c,q) and retrieve W[q\ in 0(log IT'D time. 



3 Alphabet-Friendly FM-Index 

We now have all the tools we need in order to build a version of the FM-index 
that scales well with the alphabet size. The crucial observation is the following. 
To build the FM-index we need to solve two problems: a) to compress T**™* up 
to Hk{T), and b) to compute Occ(c, g) in time independent of n. We use the 
boosting technique to transform problem a) into the problem of compressing the 
strings si, S2 , . . . , Sz up to their zero-th order entropy, and we use the wavelet 
tree to create a compressed (up to Hq) and indexable representation of each Si 
thus solving simultaneously problems a) and b) . The details of the construction 
are given in Figure 4. 

To compute T*'“'*[q'], we first determine the substring Sy containing the g-th 
character of by computing y = Ranki(,8, g). Then we exploit the wavelet 
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1. Use Theorem 2 to determine the optimal partition si, S2, ■ ■ ■ , Sz of T*”"* with re- 
spect to f{t) = (Ktlog \S\ log log t)/ log t -I- (1 -I- IT’D logn, where K is such that 
(TFtlog |X'| loglogt)/logt is larger than the 0((t log |X'| log log t)/ log t) term in 
Theorem 4. 

2. Build a binary string B that keeps track of the starting positions in of the 

Si’s. The entries of B are all zeroes except for the bits at positions I'^jI 

i = 1, . . . , z which are set to 1. Construct the data structure of Theorem 3 over 
the binary string B. 

3. For each string Si, i = 1, . . . , z build: 

(a) the array Ci[l, IT’D such that Ci[c] stores the occurrences of character c within 
SlS2 • • • Si-i; 

(b) the wavelet tree 



Fig. 4. Construction of an alphabet-friendly FM-index. 



tree Ty to determine By Theorem 3 the former step takes 0(1) time, 

and by Theorem 4 the latter step takes 0(log IT'D time. 

To compute Occ(c, q), we initially determine the substring Sy where the row q 
occurs, y = Ranki(,S, q). Then we exploit the wavelet tree Ty and the array Cy[c\ 
to compute Occ(c,q) — Occ.s^ (c,q') +Cy[c\, where q' = q — kil- Again, by 

Theorems 3 and 4 this computation takes overall 0(log |T|) time. 

Combining these bounds with the results stated in Section 2.3, we obtain that 
the alphabet-friendly FM-index takes 0(plog |T|) time to count the occurrences 
of a pattern P[l,p] and 0(log |T| (log^ nj log log n)) time to retrieve the position 
of each occurrence. 

Concerning the space occupancy we observe that by Theorem 3, the storage 
of B takes [log (")] -I- 0((n log logn)/ logn) bits. Each array Ci takes 0(|T| logn) 

bits, and each wavelet tree % occupies \si\Ho{si) + Q ^ 1 | ^ ^ig°| s!°^ ^ ^ 
(Theorem 4). Since log (") < 2 ; logn, the total occupancy is bounded by 

^ (|sDgo(5»)+A'|si| +(1+1^1) log ^)+0((^ log log ^)/ log ■ 

i=l ^ log I Si I / 

Function f{t) defined at Step 1 of Figure 4 was built to match exactly the 
overhead space bound we get for each partition, so the partitioning was optimally 
built for that overhead. Hence we can apply Theorem 2 to get that the above 
summation is bounded by 

nH,m + ) + 0(|rr- logn) . (2) 

We are interested in bounding the space occupancy in terms of Hk only for 
k < alog|j;| n for some a < 1. In this case we have \S\^ < n“ and (2) becomes 



nHk{T) -bO(log|T|(nloglogn)/logn) . 



(3) 
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We achieve the following result^: 

Theorem 5. The data structure described in Figure 4 indexes a string T[l,n] 
over an arbitrary alphabet \E\, using a storage bounded by 

nHk{T) -|-0(log|r|(nloglogn)/logn) 

bits for any k < o;log| 2 ;| n and 0 < a < 1. We can count the number of oc- 
currences of a pattern P[l,p] in T in 0{plog\U\) time, locate each occurrence 
in 0(log jlfKlog^ n/ loglog n)) time, and display a text substring of length I in 
0((^ -I- log^ n/loglogn) log I A'l) time. 

It is natural to ask whether a more sophisticated data structure can achieve 
a nHk{T) -\- o(n) space bound without any restriction on the alphabet size or 
context length. The answer to this question is negative. To see this, consider the 
extreme case in which | ill = n, that is, the input string consists of a permutation 
of n distinct characters. In this case we have Hk{T) = 0 for fc > I. Since 
the representation of such string requires 6*(nlogn) bits, a self index of size 
nHk{T) o(n) bits cannot exist. 

Finally, we note that the wavelet tree alone, over the full BWT transformed 
text would be enough to obtain the time bounds we achieved. However, 

the resulting structure size would depend on Hq{T) rather than Hk{T). The 
partitioning of the text into areas is crucial to obtain the latter space bounds. 
A previous technique combining wavelet trees with text partitioning [15] takes 
each run of equal letters in as an area. It requires 2n{Hk log | ill -I- 1 -I- o(l)) 
bits of space and counts pattern occurrences in the optimal 0{p) time. It would 
be interesting to retain the optimal space complexity obtained in this work and 
the optimal search time 0{p). 
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Abstract. In a bulk update of a search tree a set of individual updates 
(insertions or deletions) is brought into the tree as a single transaction. 
In this paper, we present a bulk- insert ion algorithm for the class of (a, b)- 
trees (including B"'"-trees). The keys of the bulk to be inserted are divided 
into subsets, each of which contains keys with the same insertion place. 
From each of these sets, together with the keys already in the insertion 
place, an (o, 6)-tree is constructed and substituted for the insertion place. 
The algorithm performs the rebalancing task in a novel manner minimiz- 
ing the number of disk seeks required. The algorithm is designed to work 
in a concurrent environment where concurrent single-key actions can be 
present. 



1 Introduction 

Bulk insertion is an important index operation, for example in document data- 
bases and data warehousing. Document databases usually apply indices contain- 
ing words and their occurrence information. When a new document is inserted 
into the database, a bulk insertion containing words in this document will be 
performed. Experiments of a commercial system designed for a newspaper house 
in Finland [15] have shown that a bulk insertion can be up to two orders of 
magnitude faster than the same insertions individually performed. 

Additions to large data warehouses may number in the hundreds of thousands 
or even in the millions per day, and thus indices that require a disk operation 
per insertion are not acceptable. As a solution, a new B-tree like structure for 
indexing huge warehouses with frequent insertions is presented in [8] . This struc- 
ture is similar to the buffer tree structure of [1,2]; the essential feature is that 
one advancing step in the tree structure always means a search phase step for a 
set of several insertions. 

In this paper, we consider the case in which the bulk, i.e., the set of keys to 
be inserted, fits into the main memory. This assumption is reasonable in most 
applications. Only some extreme cases of frequent insertions into warehouses 
do not fulfil this requirement. We present a new bulk-insertion algorithm, in 
which the possibility of concurrent single- key operations are taken into account. 
The bulk insertion is performed by local operations that involve only a constant 
number of nodes at a time. This makes it possible to design efficient concurrency 
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control algorithms, because only a constant number of nodes need be latched at 
a time. Allowing concurrent searches is vital in document databases [15] and in 
www-search-engine applications. 

The search trees to be considered are a class of multi-way trees, called (a, b)- 
trees [6, 12]. The class of (a, 6)-trees is a generalization of B+-trees: in an (a, b)- 
tree, b > 2a — 1, a and b denote the minimum and the maximum number of 
elements in a node. The trees considered are external, i.e., keys are stored in the 
leaves and internal nodes contain routing information. 

For our model of 1/ 0-complexity we assume that each node of the tree is 
stored in one disk page. We assume that the current path from the root to a leaf 
(or the path not yet reached a leaf but advancing towards a leaf) is always found 
in the main memory, but otherwise accessing a node requires one 1/ 0-operation. 
Moreover, in our model we count writing (or reading) of several consecutive disk 
pages as one 1/ 0-operation. This is justified whenever the number of consecutive 
pages is “reasonable” because the seek time has become a larger and larger factor 
in data transfer to/from disk [17]. In our paper this property of the model comes 
into use when a portion of the bulk goes into the same leaf, and this (usually a 
relatively small) part of the bulk will be written on disk. 

2 General Bulk Insertion 

In a level-linked (a, b)-tree [6, 12], a > 2, 6 > 2a — 1, all paths from the root to a 
leaf have the same length. The leaves contain at least a and at most b keys, and, 
similarly, the internal nodes have at least a and at most b children. The root of 
the tree is an exception and has at least 2 and at most b children. In leaves each 
key is coupled with a data record (or with a pointer to data) . An internal node 
V with n children is of the form 

{Po){ri,Pi){r2,P2) ■ ■ ■ (r„,p„)(r„+i,p„+i), 

where for i = 1, . . . ,n, is the pointer to the zth child of v. This zth child is 
the root of the subtree that contains the keys in the interval (r^, r^+i]. Values r,, 
l<z<n-|-l, inan internal node are called routers. We say that node v covers 
the interval (ri,r„+i]. 

Router ri is smaller than any key in the subtree rooted at v, called the 
lowvalue of node w, denoted lowvalue{v) , and router r„_|_i is the largest possible 
key value in this subtree, called the highvalue and denoted highvalue{v) . Pointer 
Po points to the node that precedes, and Pn+i points to the node that follows node 
v at the same level. If node v is the parent of a leaf I and pointer Pi,i = 1, . . . , n, 
in v points to I, then the lowvalue of leaf I is Vi and the highvalue is r^+i. 

The basic idea of our I/O-optimal bulk-insertion algorithm is that the keys 
of the bulk sorted in the main memory will efficiently be divided into subsets, 
each of which contains keys that have the same insertion place (which is a node 
in the leaf level). From each of these sets, called simple bulks, together with 
the keys already in the insertion place, an (a, 6)-tree, called an insertion tree, 
is constructed and substituted for the insertion place. After this process, called 
bulk insertion without rebalancing, has been completed, the structure contains 
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all keys of the bulk and can already be used as a search tree with logarithmic 
search time. In order to retain the (a, 6)-tree properties the structure needs, of 
course, rebalancing. 

Moreover, we aim at a solution where concurrency is allowed and concurrency 
control is ejficient in the sense that each process will latch only a constant number 
of nodes at a time and that a latch on a node is held only for a constant time. 
The concurrency control needed before rebalancing is simple latch coupling in the 
same way as for single-key updates (insertions or deletions). The efficient latch 
coupling in the search phase of an update (bulk or single) applies might-write 
latches, which exclude other updates but allow readers to apply their shared 
latches. Shared or read latches applied by readers exclude only the exclusive 
latches required on nodes to be written. 

Given an (a, 6) -tree T and a bulk with m keys, denoted k\, . . . ,km, in as- 
cending order, the bulk insertion into T without rebalancing works as follows. 

Algorithm BI (Bulk Insertion) 

Step 1. Set i = I, and set p = the root of T. 

Step 2. Starting at node p search for the insertion place k of key ki. Push 
each node in the path from node p to li onto stack S. In the search process apply 
latch coupling in the might-write mode. When leaf li is found, the latch on it 
will be upgraded into an exclusive latch. The latch on the parent of li will be 
released. 

Step 3. Let kij^j be the largest key in the input bulk that is less than or equal 
to the highvalue of k. From the keys ki, , ki+j together with the keys already 
in li, an (a, 6)-tree Bi is constructed and substituted for k in T. This will be 
done by storing the contents of the root of Bi into node k, so that no changes 
is needed in the parent of k. Release the latch on the node that is now the root 
of B^. 

Step 4- Set i = i-\-j-\-l. If i>m, then continue to Step 5. Otherwise pop 
nodes from stack S until the popped node p covers the key ki. If such a node is 
not found in the stack (this may occur if the root has been split after the bulk 
insertion started), set p as the new root. The nodes which are popped from the 
stack are latched in the shared mode, but latch-coupling is not used. Return to 
Step 2. 

Step 5. Now all insertion positions have been replaced by the corresponding 
insertion trees. Rebalance the constructed tree by performing Algorithm SBR 
(given below) for each Bi in turn. 

If concurrent single-key updates are allowed, it may happen that in Step 4 
the algorithm must return even to the root although in the original tree only a 
few steps upward would have been enough in order to find the node from which 
to continue. 

The algorithm composed of the first 4 steps of the above algorithm is called 
Algorithm BIWR (Bulk Insertion Without Rebalancing). In the following dis- 
cussion of the complexity of Algorithm BIWR we assume that the concurrency 
is limited to concurrent searches. 
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If searching must be done in the standard way, that is, only pointers from 
parents to children are followed, it is clear that Algorithm BIWR is optimal as 
to nodes visited in T and thus in the number of nodes accessed. This is because 
each node in the paths from the root to the insertion positions must be accessed 
at least once, and this is exactly what the above algorithm does. If parent links 
together with level links are applied, a better performance can be obtained in 
some special cases, but from results in [3] it is straightforward to derive that no 
asymptotic improvement can be obtained. 

The insertion tree Bi is constructed in the main memory, but 1/ 0-operations 
are needed in writing it on disk as a part of the whole tree. For doing this only 
one disk seek is needed and in our model thus only one operation. We have: 



Theorem 1. Let T he an (a, b)-tree, and assume that a bulk ofm keys is inserted 
into T by Algorithm BIWR (Bulk Insertion Without Rebalancing, the first four 
steps of Algorithm BI). Then the resulting tree (which has logarithmic depth but 
does not fulfil the {a,b)-balance conditions) contains exactly the keys that were 
originally in T or were members of the bulk. The I/O- complexity of Algorithm 
BIWR is 



O{k + L) = 0{L), 



where k denotes the number of insertion trees Bi constructed in Step 3 of the 
algorithm and I denotes the number of different nodes that appear in the paths 
from the root of the original tree T to the insertion places /. 



3 Rebalancing 

Our next task is to perform rebalancing. Our solution for rebalancing is designed 
such that concurrent searches and single-key updates are possible. 

We assume first that we are given a situation in which the whole bulk to 
be inserted has the same insertion place h, and Algorithm BIWR has produced 
a new tree T, in which li has been replaced by an insertion tree Bi (Step 3 
in Algorithm BIWR). Before we can start the rebalancing task we must have 
obtained a shared lock on the key interval [k,k'], where k and k', respectively, 
are the smallest and the largest key in Bi. This lock is requested in Step 3 
in Algorithm BIWR before the replacement of li by Bi can take place. This 
guarantees that no updates that would affect B\ could occur during rebalancing, 
provided that performing updates requires obtaining an exclusive lock on the key 
to be inserted or deleted, see e.g. [9]. (The locks are not the same as the latches; 
latches are for physical entities of a database, and locks for logical entities. 
Latches are short duration semaphores, and locks are usually held until the 
commit of the transaction involved.) 

Now if Bi contains one leaf only, we are done, and the lock on [k, k'] can 
be released. Otherwise, we perform the simple bulk rebalancing in the following 
way. 

Algorithm SBR (Simple Bulk Rebalancing) 

Step 1. Latch exclusively the parent of the root of B\ and denote the latched 
node by p. Set h= 1. 




Concurrency Control and I/O-Optimality in Bulk Insertion 165 



Step 2. Split node p such that the left part contains all pointers to children 
that store keys smaller than the smallest key in B \ , and the right part all pointers 
to children that store keys larger than the largest key in Bi. Denote the nodes 
thus obtained by pi and Pr- Observe that both pi and Pr exist; in the extreme 
case node pi contains only the lowvalue and the level link to the left and Pr only 
the highvalue and the level link to the right. In all cases p is set to pf, that is, 
Pi is the node that remains latched, and p = pi does not point to the root of B\ 
(or its ancestor) any more. Moreover, notice that neither pi nor pr can contain 
more than b elements, even though, when returning from Step 3, p could contain 
6+1 elements. See Fig. 1 for illustration. 





(b) 



Fig. 1. Splitting the parent of the root of the insertion tree, a = 2 and 6 = 4. (a) 
Original tree. The root of the insertion tree is shaded, (b) Split tree with updated level 
links at height 1. 



Exclusively latch pr, and the leftmost and rightmost nodes, denoted qi and 
Qr, respectively, at the height h in Bi. Then compress (by applying fusing or 
sharing) node pi together with node qi, node Pr together with qr, and also adjust 
the level links appropriately. (The nodes in T and nodes in Bi at height h are all 
linked together by level links and no violations against the (a, 6)-tree property 
occur in these nodes.) Release all latches held. 

Step 3. Set h = h + 1. At height h in T latch exclusively the node, denoted 
p, that has lowvalue smaller than the smallest key in Bi and highvalue larger 
than the largest key in Bi. If in Step 2 node Pr, one level below, was not fused 
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with qr but remained (perhaps shortened because of sharing), add this node as 
a new child to p. For the moment, allow p grow one too large, if necessary. If h 
is smaller than the height of Bi, then return to Step 2. 

Step 4- Now the insertion tree Bi has been properly level-linked with the 
rest of the tree, with the exception of the root and the leaf level. As in Step 2, 
split node p appropriately into pi and Pr such that the root of Bi can be put 
between them. The nodes at the leaf level which thus far has been level-linked 
to the root of B\ must now be level-linked with the leftmost and, respectively, 
the rightmost leaf node in B\. For the operation, all changed nodes must be 
exclusively latched. At the end, all latches held are released. 

Step 5. The whole insertion tree Bi has now been correctly level-linked, but 
it might be that the root of B\ and its right brother have no parent, that is, 
they can be reached only by level links and not by child links from their parents. 
(See Fig. 2.) Thus rebalancing is still needed above the root of B\, and the 
need of splits may propagate up to the root of the whole tree. In a concurrent 
environment this remaining rebalancing can be done exactly as for single inserts 
in B*'”*'-trees [16]. Figure 3 shows the final rebalanced tree. 




Fig. 2. Tree of Fig. 1 after the insertion tree has been correctly level-linked. 




Fig. 3. Tree of Fig. 2 when bulk rebalancing has been finished. 



It is important to note that, for rebalancing, we cannot simply cut the tree 
T starting from the insertion place up to the height of B \ , and then lift Bi to its 
right position. Such an algorithm would need too much simultaneous latching 
in order to set the level links correctly. The level links are essential because 
they guarantee the correctness of concurrent searching at all times. The simpler 
solution to “merge” Bi with T by cutting T at ^i, joining the left part with Bi, 
and joining the result with the right part [13] is not applicable in a concurrent 
environment, either. 

First, for the correctness and complexity (Theorem 2-6), we consider Al- 
gorithm SBR in an environment, where only concurrent searches are allowed. 
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Notice that then node p as specified in Step 2 is directly obtained from the stack 
of nodes constructed in the search phase of Algorithm BI. 

The following theorem is immediate. Notice that in Step 3 of Algorithm SBR 
compressing of nodes as described is always possible because the insertion tree 
Bi is in balance. Compressing two nodes means, in the same way as in standard 
B-tree rebalancing, that two nodes are either made as one node (fusing) or their 
contents are redistributed (sharing) such that both nodes meet the (a, 6)-tree 
conditions. 

Theorem 2. Let T he a tree yielded by Algorithm BIWR such that from the 
inserted bulk of size m only one insertion tree was constructed. Algorithm SBR 
(Simple Bulk Rebalancing) rebalances T, that is, yields an (a, b)-tree that contains 
exactly the keys originally in T. The worst case I/O- complexity of Algorithm 
SBR (when only concurrent searches are present) is 0(logm) (Steps 1-3), plus 
0(logn) (Step 4), where n denotes the size ofT. 

Notice that the worst case complexity 0(log n) of Step 5 comes from the fact 
that nodes above the root of Bi may be full; this worst case may occur also for 
single insertions. Thus, and because it may be necessary to split h nodes, where 
h denotes the height of Bi, the above algorithm is asymptotically optimal. 

Step 5 in Algorithm SBR can be considered as an elimination of an 5 + 1- 
or 6 + 2-node (node that contains 6 -I- 1 or 6 -|- 2 elements) from the tree. This is 
because the parent of the root of B\ can have got one or two new children. But, as 
shown in [5], elimination of a 6-1- 1-node takes amortized constant time, provided 
that 6 > 2a. (By amortized time we mean the time of an operation averaged 
over a worst-case sequence of operations starting with an empty structure. See 
[12, 18].) The same holds, of course, for the elimination of a 6 -I- 2-node. Thus 
Theorem 2 implies: 

Theorem 3. Let T he a tree yielded by Algorithm BIWR such that from the 
inserted bulk of size m only one insertion tree was constructed. Algorithm SBR 
(Simple Bulk Rebalancing) rebalances T, that is, yields an (a, b)-tree that contains 
exactly the keys originally in T. The amortized I/O- complexity of Algorithm SBR 
(when only concurrent searches are present) is 0(logm). 

The result of Theorem 3 requires that each Bi in Algorithm BI is constructed 
so that at most two nodes at each level of Bi contain exactly a or 6 keys or have 
exactly a or b children. This is possible since a > 2 and 6 > 2a, see [7]. 

Assume that a bulk insertion of m keys without rebalancing has been applied 
to an (a, 6)-tree yielding a tree denoted by T. Assume that the bulk was divided 
into k insertion trees, denoted B\, B 2 , . . . , Bk. Rebalancing T, that is, the final 
step of Algorithm BI, can now be performed by applying Algorithm SBR for 
Bi, B 2 , . . ., and Bk, in turn. The cost of rebalancing includes (i) the total cost of 
Steps 1-3 of Algorithm SBR for Bi, ..., Bk and (ii) the total cost of rebalancing 
(Step 4) above node pi that has become the parent of Bi, i = 1, . . . ,k. Part (i) 
has 1/ 0-complexity 0{Sk^^logmi), where mi denotes the size of Bi, and part 
(ii) has the obvious lower bound f2{L), where I denotes the number of different 
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nodes appearing in the paths from the root to the insertion places in the original 
tree. It is easy to see that 0{L + S^^ilognii) bounds from above part (ii). Of 
those nodes that are full before the rebalancing starts only L can be split because 
of rebalancing. Rebalancing of one Bi cannot produce more than O(logmi) new 
full nodes that may need be split by rebalancing a subsequent Bj . Thus the total 
number by splits and also the number of I/Os needed for the whole rebalancing 
task is 0{L + log rm). 

We have: 

Theorem 4. Assume that a bulk insertion without rebalancing has been applied 
to an (a, b)-tree, and assume that the bulk was divided into k insertion trees with 
sizes mi, TO 2 , . . . , mfe. Then the worst case I/O- complexity of rebalancing (the 
final step of Algorithm BI), provided that only concurrent searches are present, 
is 

6)(T'/Lilogm* + L), 

where L is number of different nodes in the paths from the root to the insertion 
places (roots of the insertion trees) before the rebalancing starts. 

For the amortized complexity we have: 

Theorem 5. Assume that a bulk insertion without rebalancing has been applied 
to an {a,b)-tree, b > 2a, and assume that the bulk was divided into k insertion 
trees with sizes mi, m 2 , . . . ,mfc. Then the amortized I/O- complexity of rebalanc- 
ing, provided that only concurrent searches are present, is 

6>(T’'Li log mi). 



Theorems 1 and 4 imply: 

Theorem 6. The worst case I/O- complexity of a bulk insertion into an (a,b)- 
tree is 

6>(T’Lilogm* + L), 

where mi is the size of the ith simple bulk and L is the number of different nodes 
in the paths from the root to the insertion places. 

Our concurrent algorithm is meant to be used together with key searches that 
do not change the structure and with single- key operations. The pure searches 
are the most important operations that must be allowed together with bulk 
insertion. This is certainly important for www search engines, and it was vital 
for the commercial text database system reported in [15]. For the correctness, 
the issues to be taken care of are that no search paths (for pure searches or the 
search phases of insertions or deletions) cannot get lost, and that the possible 
splits or compress operations performed by concurrent single-key actions do not 
cause any incorrectness. 

Because a shared lock on the key interval of the insertion tree must have been 
obtained before rebalancing, no changes in the interval trees caused by concur- 
rent processes can occur during the bulk insertion. Thus the only possibility for 
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incorrectness (due to bulk insertion) is that a search path gets lost when a node 
above the insertion tree is split and the search would go through this node and 
end in a leaf of the insertion tree Bi, or in a leaf right to Bi, cf. Fig. 1 (b). But 
when this kind of a split occurs, the exclusive latches as defined in Step 3 of the 
algorithm prevents the search path losses. After Step 3 has been completed, all 
leaves of Bi, and all leaves to the right of Bi that have lost their parent path to 
the root (in Fig. 1 (b) one leaf to the right of the insertion tree) are again reach- 
able because of the level links set. The possible splits or compress operations of 
concurrent single-key actions imply the possibility that the nodes of the search 
path to the insertion place pushed on stack are not always parents of the split 
nodes. Thus the parent must be searched, see Step 3, by a left-to-right traverse 
starting from the node that is popped from the stack. (Cf. [16].) 

The pure searches and the search phases of single-key actions and the search 
phase of the bulk insertion apply latch coupling in the appropriate mode, and 
all changes in nodes are made under an exclusive latch, which prevents all other 
possible path losses. 

We have: 

Theorem 7. The concurrent algorithm BI and standard concurrent searches 
and concurrent single-key actions all applied to the same level-linked {a, b) -tree 
run correctly with each other. 

4 Conclusion 

We have presented an I/O-optimal bulk insertion algorithm for (a, 6)-trees, a 
general class of search trees that include B-trees. Some ideas of the new algo- 
rithm stem from earlier papers on bulk updates [11, 15]. The new aspect in the 
present paper is that we couple efficient concurrency control with an I/O-optimal 
algorithm, and the 1/ 0-complexity is carefully analyzed in both worst case and 
amortized sense. 

The same amortized time bound has been proved for relaxed (a,6)-trees in 
[10], but with linear worst case time. In addition, although [10] gives operations 
to locally decrease imbalance in certain nodes, it does not give any deterministic 
algorithm to rebalance the whole tree after group insertion. Algorithms based on 
relaxed balancing also have the problem that they introduce new almost empty 
nodes at intermediate stages. 

The idea of performing bulk insertion by inserting small trees [14] is indepen- 
dently presented for R-trees in [4]. In [4] concurrency control is not discussed, 
whereas our main contribution is to introduce efficient concurrency control into 
I/O-optimal bulk insertion. 

Our method of bulk rebalancing can also be applied for buffer trees [1,2]. 
Buffer trees are a good choice for efficient bulk insertion in the case in which the 
bulk is large and does not fit into the main memory. 
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Abstract. The objective of this paper is to present an extension to the set-based 
model (SBM), which is an effective technique for computing term weights based 
on co-occurrence patterns, for processing conjunctive and phrase queries. The in- 
tuition that semantically related term occurrences often occur closer to each other 
is taken into consideration. The novelty is that all known approaches that ac- 
count for co-occurrence patterns was initially designed for processing disjunctive 
(OR) queries, and our extension provides a simple, effective and efficient way to 
process conjunctive (AND) and phrase queries. This technique is time efficient 
and yet yields nice improvements in retrieval effectiveness. Experimental results 
show that our extension improves the average precision of the answer set for all 
collection evaluated, keeping computational cost small. For the TReC-8 collec- 
tion, our extension led to a gain, relative to the standard vector space model, 
of 23.32% and 18.98% in average precision curves for conjunctive and phrase 
queries, respectively. 



1 Introduction 

Users of the World Wide Web are not only confronted by an immense overabundance 
of information, but also by a plethora of tools for searching for the web pages that 
suit their information needs. Web search engines differ widely in interface, features, 
coverage of the web, ranking methods, delivery of advertising, and more. Different 
search engines and portals have different (default) semantics of handling a multi-word 
query. Although, all major search engines, such as Altavista, Google, Yahoo, Teoma, 
uses the AND semantics, i.e. conjunctive queries, (it is mandatory for all the query 
words to appear in a document for it to be considered). 

In this paper we propose a extension to the set-based model [1, 2] to process con- 
junctive and phrase queries. The set-based model uses a term-weighting scheme based 
on association rules theory [3]. Association rules are interesting because they provide 
all the elements of the tf x idf scheme in an algorithmically efficient and parame- 
terized approach. Also, they naturally provide for quantification of representative term 
co-occurrence patterns, something that is not present in the tf x idf scheme. 

* This work was supported in part by the GERINDO project-grant MCT/CNPq/CT-INFO 
552.087/02-5 and by CNPq grant 520.916/94-8 (Nivio Ziviani). 
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We evaluated and validated our extension of the set-based model (SBM) for pro- 
cessing conjunctive and phrase queries through experimentation using two reference 
collections. Our evaluation is based on a comparison to the standard vector space model 
(VSM) adapted to handle the these type of queries. Our experimental results show that 
the SBM yields higher retrieval performance, which is superior for all query types 
and collections considered. For the TReC-8 collection [4], containing 2 gigabytes of 
size and approximately 530,000 documents, the SBM yields, respectively, an aver- 
age precision that are 23.32% and 18.98% higher than the VSM for conjunctive and 
phrase queries. The set-based model is also competitive in terms of computing perfor- 
mance. For the WBR99 collection, containing 16 gigabytes of size and approximately 
6,000,000 web pages, using a log with 100,000 queries, the increase in the average 
response time is, respectively, 6.21% and 14.67% for conjunctive and phrase queries. 

The paper is organized as follows. The next section describes the representation 
of co-occurrence patterns based on a variant of association rules. A review of the set- 
based model is presented in the Section 3. Section 4 presents our extension of the set- 
based model for processing conjunctive and phrase queries. In section 5 we describe the 
reference collections and the experimental results comparing the VSM and the SBM 
for processing disjunctive and phrase queries. Related works are discussed in Section 6. 
Finally, we present some conclusions and future work in Section 7. 

2 Preliminaries 

In this section we introduce the concept of termsets as a basis for computing term 
weights. In the set-based model a document is described by a set of termsets, where 
termset is simply an ordered set of terms extracted from the document itself. 

Let T = {ki, k 2 , ..., kj} be the vocabulary of a collection of documents D, that 
is, the set of t unique terms that may appear in a document from D. There is a total 
ordering among the vocabulary terms, which is based on the lexicographical order of 
terms, so that ki < ki+i, for 1 < i f — 1. 

Definition 1. An n-termset s is an ordered set of n unique terms, such that s C T. 
Notice that the order among terms in s follows the aforementioned total ordering. 

Let S = {si, S 2 , ..., S 2 *} be the vocabulary-set of a collection of documents D, that 
is, the set of 2* unique termsets that may appear in a document from D. Each document 
j from D is characterized by a vector in the space of the termsets. With each termset Si, 
1 < z < 2*, we associate an inverted list Isi composed of identihers of the documents 
containing that termset. We also dehne the frequency dsi of a termset Si as the number 
of occurrences of Si in D, that is, the number of documents where Si C dj and dj G D, 
^ < j < N- The frequency dsi of a termset Si is the length of its associated inverted 
list (I Is^ I). 

Definition 2. A termset Si is a frequent termset if its frequency dsi is greater than 
or equal to a given threshold, which is known as support in the scope of association 
rules [3], and referred as minimal frequency in this work. As presented in the original 
Apriori algorithm [5], an n-termset is frequent if and only if all of its n — l-termsets 
are also frequent. 
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The proximity information is used as a pruning strategy to find only the termsets 
occurrences bounded by an specified proximity threshold (referred as minimal proxim- 
ity), conforming with the assumption that semantically related term occurrences often 
occur closer to each other. 

Definition 3. A closed termset csi is a frequent termset that is the largest termset 
among the termsets that are subsets of cSi and occur in the same set of documents. 
That is, given a set T> D of documents and the set Sx> C S of termsets that occur in 
all documents from T> and only in these, a closed termset cSi satisfies the property that 
'^Sj G S'u\csj C Sj. 

For sake of processing disjnnctive queries (OR), closed termsets are interesting be- 
cause they represent a reduction on the computational complexity and on the amount of 
data that has to be analyzed, withont loosing information, since all frequent termsets in 
a closure are represented by the respective closed termset [1,2]. 

Definition 4. A maximal termset msi is a frequent termset that is not a subset of any 
other frequent termset. That is, given the set St> C S of frequent termsets that occur 
in all documents from T>, a maximal termset mSi satisfies the property that '^Sj G 
ST>\mSi C Sj. 

Let FT be the set of all frequent termsets, and CFT be the set of all closed termsets 
and M FT be the set of all maximal termsets. It is straightforward to see that the follow- 
ing relationship holds: MFT C CFT C CT. The set MFT is orders of magnitude 
smaller than the set CFT, which itself is orders of magnitude smaller than the set FT. 
It is proven that the set of maximal termsets associated with a document collection are 
the minimum amount of information necessary to derive all freqnent termsets associated 
with that collection [6]. 

Generating maximal termsets is a problem very similar to mining association rnles 
and the algorithms employed for the latter is our starting point [3]. Our approach is 
based on an efficient algorithm for association rule mining, called GenMax [7], which 
has been adapted to handle terms and docnments instead of items and transactions, 
respectively. GenMax uses backtracking search to enumerate all MFT. 

3 Review of the Set-Based Model (SBM) 

In the set-based model, a document is described by a set of termsets, extracted from the 
document itself. With each termset we associate a pair of weights representing (a) its 
importance in each document and (b) its importance in the whole document collection. 
In a similar way, a query is described by a set of termsets with a weight representing 
the importance of each termset in the qnery. The algebraic representation of the set 
of termsets for both documents and queries correspond to vectors in a 2‘-dimensional 
Enclidean space, where t is equal to the number of unique index terms in the document 
collection. 

3.1 Termset Weights 

Term weights can be calculated in many different ways [8, 9]. The best known term 
weighting schemes use weights that are function of (i) tfij, the number of times that an 
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index term i occurs in a document j and (ii) dfi, the number of documents that an index 
term i occurs in the whole document collection. Such term-weighting strategy is called 
tf X idf schemes. The expression for idfi represents the importance of term i in the 
collection, it assigns a high weight to terms which are encountered in a small number of 
documents in the collection, supposing that rare terms have high discriminating value. 

In the set-based model, the association rules scheme naturally provides for quantifi- 
cation of representative patterns of term co-occurrences, something that is not present 
in the tf x idf approach. To determine the weights associated with the termsets, we 
also use the number of occurrences of a termset in a document, in a query, and in the 
whole collection. Formally, the weight of a termset i in a document j is defined as: 

N 

Wt,j = (1 + logs/jj) X idsi = (1 -flog s/ij) X log(l -f — ) (1) 

where N is the number of documents in the collection, sfij is the number of occur- 
rences of the termset i in document dj, and idsi is the inverted frequency of occurrence 
of the termset Si in the collection, sfij generalizes tfij in the sense that it counts the 
number of times that the termset Si appears in document dj . The component idsi also 
carries the same semantics of idfi, but accounting for the cardinality of the termsets 
as follows. High-order termsets usually have low frequency, resulting in large inverted 
frequencies. Thus, this strategy assigns large weights to termsets that appear in small 
number of documents, that is, rare termsets result in greater weights. 



3.2 Similarity Calculation 



Since documents and queries are represented as vectors, we assign a similarity measure 
to every document containing any of the query termsets, defined as the normalized 
scalar product between the set of document vectors dj, 1 < j < N, and the query 
vector q. This approach its equivalent to the cosine of the angle between these two 
vectors. The similarity between a document dj and a query q is defined as: 



sim{q, dj) 



dj • q 

\dj\ X |q| 



Esgs, X Ws,q 
\dj \ X \q\ 



( 2 ) 



where Wsj is the weight associated with the termset s in document dj, Ws,q is the 
weight associated with the termset s in query q, Sq is the set of all termsets such that all 
s C q. We observe that the normalization (i.e., the factors in the denominator) was not 
expanded, as usual. The normalization is done using only the 1 -termsets that compose 
the query and document vectors. This is important to reduce computational costs be- 
cause computing the norm of a document using termsets might be prohibitively costly. 



3.3 Searching Algorithm 

The steps performed by the set-based model to the calculation of the similarity met- 
rics are equivalent to the standard vector space model. Figure 1 presents the searching 
algorithm. First we create the data structures (line 4) that are used for calculating the 
document similarities A among all termsets Sq of a document dj . Then, for each query 
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term, we retrieve its inverted list, and determine the first frequent termsets, i.e., the fre- 
quent termsets of size equal to 1, applying the minimal frequency threshold mf (lines 
5 to 10). The next step is the enumeration of all termsets based on the 1-termsets, hl- 
tered by the minimal frequency and proximity threshold (line 11). After enumerating 
all termsets, we evaluated its inverted lists, calculating the partial similarity of a termset 
Si G Sq to a document dj (lines 12 to 17). After evaluating termsets, we normalize the 
document similarities A by dividing each document similarity Aj G A by the norm of 
the document dj (line 18). The hnal step is to select the k largest similarities and return 
the corresponding documents (line 19). 



SBMfQ, mf, mp, k) 

Q .' a set of query terms 
mf ; minimum frequency threshold 
mp ; minimum proximity threshold 
k .• number of documents to be returned 

1. Let Abe a set of accumulators 

2. Let Cq be a set of 1-termsets 

3. Let Sq be a set of termsets 

4. A = 0, S = 0 

5. for each query term t G Q do begin 

6. if dft > mf then begin 

7. Obtain the 1 -termset stfrom term t 

8. Cq = Cq U {st} 

9. end 

10. end 

11. Sq = Termsets JJen(Cq, mf mp) 

12. for each termset Si G Sq da begin 

13. for each [dj , s/i,j] in Isi dp begin 

14. if Aj (f: A then A = A U {Aj} 

15. Aj = Aj -|- Wsij X Wsi.q, from Eq. (1). 

16. end 

17. end 

18. for each accumulator Aj G A dp Aj = Aj ^ \ dj\ 

19. determine the k largest Aj G A and return the corresponding documents 

20. end 



Fig. 1. The set-hased model searching algorithm 



3.4 Computational Complexity 

The complexity of the standard vector space model and the set-based model is linear 
with respect to the number of documents in the collection. Formally, the upper bound 
on the number of operations performed for satisfying a query in the vector space model 
is 0(|<7| X N), where Igl is the number of terms in the query and N is the number 
of documents in the collection. The worst case scenario for the vector space model is 
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a query comprising the whole vocabulary 7 (l^l = t), which results in a merge of all 
inverted lists in the collection. The computational complexity for set-based model is 
0{cN), where c is the number of termsets, a value that is 0(2l9l), where |q| is the 
number of terms in the query. These are worst case measures and the average measures 
for the constants involved are much smaller [1,2]. 



4 Modeling Conjunctive and Phrase Queries in SBM 

In this section we show how to model conjunctive and phrase queries using the frame- 
work provided by the original set-based model (see Section 3). Our approach does not 
modify its algebraic representation, and the changes to the original model are minimal. 

4.1 Conjunctive Queries 

The main modification of set-based model for the conjunctive and phrase query pro- 
cessing is related to the enumeration of termsets. In the original version of the set-based 
model, the enumeration algorithm determines all closed termsets for a given user query, 
and the minimal frequency and proximity thresholds. Each mined closed termset repre- 
sents a valid co-occurrence pattern in the space of documents defined by the terms of 
the query. For disjunctive queries, each one of these patterns contributes for the simi- 
larity between a document and a query. The conjunctive query processing requires that 
only the co-occurrence pattern defined for the query can be found, i.e., the occurrence 
of all query terms in a given document must be valid. If so, this document can be added 
to the response set. 

A maximal termset corresponds to a frequent termset that is not a subset of any other 
frequent termset (see definition 4). Based on this definition, we can extend the original 
set-based model enumerating the set of maximal termsets for a given user query instead 
of the set of closed termsets. To verify if the enumerated set is valid, we check the 
following conditions. First, the mined set of maximal termsets must be composed by an 
unique element. Second, this element must have all query terms. If all these conditions 
are true, we can evaluate the inverted list of maximal termset found, calculating its 
partial similarity to each document dj (lines 12 to 17 of the algorithm of Figure 1). 
The final steps are the normalization of document similarities and the selection of the k 
largest similarities, returning the corresponding documents. 

The proximity information is used as a pruning strategy to hnd only the maximal 
termset occurrences bounded by the minimal proximity threshold, conforming with the 
assumption that semantically related term occurrences often occur closer to each other. 
This pruning strategy is incorporated in the maximal termsets enumeration algorithm. 



4.2 Phrase Queries 

Search engines are used to hnd data in response to ad hoc queries. However, a signihcant 
fraction of the queries include phrases, where the user has indicated that some of the 
query terms must be adjacent, typically by enclosing them in quotation marks. Phrases 
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have the advantage of being nnambiguous concept markers and are therefore viewed as 
a valuable addition to ranked queries. 

A standard way to evaluate phrase queries is to use an inverted index, in which 
for each index term there is a list of postings, and each posting inclndes a document 
identifier, an in-docnment freqnency, and a list of ordinal word positions at which term 
occurs in the docnment. Given such a word-level inverted index and a phrase query, 
it is straightforward to combine the postings lists for the query to identify matching 
docnments and to rank them using the standard vector space model. 

The original set-based model can be easily adapted to handle phrase queries. To 
achieve this, we enumerate the set of maximal termsets instead of the set of closed 
termsets, using the same restrictions applied for conjunctive queries. To verify if the 
query terms are adjacent, we just check if its ordinal word positions are adjacents. The 
proximity threshold is then set to one and is used to evaluated this referential constraint. 

We may expect that this extension to the set-based model is suitable for selecting 
just maximal termsets representing strong correlations, increasing the retrieval effec- 
tiveness of both type of the queries. Our experimental results(see Section 5.3) confirm 
such observations. 

5 Experimental Evaluation 

In this section we describe experimental results for the evaluation of the set-based model 
(SBM) for conjunctive and phrase qneries in terms of both effectiveness and computa- 
tional efficiency. Onr evaluation is based on a comparison to the standard vector space 
model (VSM). We first present the experimental setnp and the reference collections 
employed, and then discnss the retrieval performance and the computational efficiency. 

5.1 Experimental Setup 

In this evalnation we use two reference collections that comprise not only the docn- 
ments, but also a set of example queries and the relevant responses for each qnery, as 
selected by experts. We quantify the retrieval effectiveness of the various approaches 
through standard measures of average recall and precision. The computational effi- 
ciency is evaluated through the query response time, that is, the processing time to 
select and rank the documents for each query. 

The experiments were performed on a Linux-based PC with a AMD-athlon 2600-t 
2.0 GHz processor and 512 MBytes RAM. Next we present the reference collections 
used, followed by the results obtained. 

5.2 The Reference Collections 

In our evaluation we nse two reference collections WBR99 and TReC-8 [4]. Table 1 
presents the main features of these collections. 

The WBR99 reference collection is composed of a database of Web pages, a set of 
example Web qneries, and a set of relevant documents associated with each example 
query. The database is composed of 5,939,061 pages of the Brazilian Web, under the 
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Table 1. Characteristics of the reference collections 



Characteristics 


Collection 


TReC-8 


WBR99 


Number of Documents 
Number of Distinct Terms 
Number of Topics 
Number of Used Topics 
Average Terms per Query 
Average Relevants per Query 
Size (GB) 


528,155 

737,833 

450 

50 (401-450) 
4.38 
94.56 
2 


5,939,061 

2,669,965 

100,000 

50 

1.94 

35.40 

16 



domain “.br”. We decided to use this collection because it represents a highly connected 
subset of the Web, that is large enough to provide good prediction of the results if the 
whole Web was used, and, at the same time, is small enough to be handled by our 
available computational resources. 

A total of 50 example queries were selected from a log of 100 000 queries submitted 
to the TodoBR search engined The queries selected were the 50 most frequent ones with 
more than two terms. Some frequent queries related to sex were not considered. The 
mean number of keywords per query is 3.78. The sets of relevant documents for each 
query were build using pooling method used for the Web-based TREC collection [10]. 
We compose a query pool formed by the top 15 documents generated by each evaluated 
model for both query types. Each query pool contained an average of 19.91 documents 
for conjunctive queries and 18.77 for phrase queries. All documents present in each 
query pool were submitted to a manual evaluation by a group of 9 users, all of them 
familiar with Web searching. The average number of relevant documents per query is 
12.38 and 6.96 for conjunctive and phrase queries, respectively. 

The TReC-8 collection [4] has been growing steadily over the years. At TREC- 
8, which is used in our experiments, the collection size was roughly 2 gigabytes. The 
documents presents in the TReC-8 collection are tagged with SGML to allow easy 
parsing, and come from the following sources: The Einancial Times, Federal Register, 
Congressional Record, Foreign Broadcast Information Service and LA Times. 

The TReC collection includes a set of example information requests (queries) which 
can be used for testing a new ranking algorithm. Each request is a description of an 
information need in natural language. The TReC-8 has a total of 450 queries, usually 
referred as a topic. Our experiments are performed with the 401-450 range of topics. 
This range of topics has 4.38 index terms per query. 

5.3 Retrieval Performance 

We start our evaluation by verifying the precision-recall curves for each model when 
applied to the reference collection. Each curve quantifies the precision as a function 
of the percentage of documents retrieved (recall). The results presented for SBM for 
both query types were obtained by setting the minimal frequency threshold to a one 

* http://www.todobr.com.br 





Processing Conjunctive and Phrase Queries with the Set-Based Model 



179 



document. The minimal proximity threshold was not used for the conjunctive queries 
evaluation, and set to one for the phrase queries. 

As we can see in Table 2, SBM yields better precision than VSM, regardless of the 
recall level. Further, the gains increase with the size of queries, because large queries 
allow computing a more representative termset, and are consistently greater for the both 
query types. Furthermore, accounting for correlations among terms never degrades the 
quality of the response sets. We conhrm such observations by verifying the overall 
average precision achieved for each model. The gains provided by SBM over the VSM 
in terms of overall precision was 7.82% and 23.32% for conjunctive queries and 9.73% 
and 18.98% for phrase queries for WBR99 and TReC-8 collections, respectively. 



Table 2. Recall-precision curves for VSM and SBM 



(a) Conjunctive queries results 



(b) Phrase queries results 



Recall (%) 


1 Precision! %) | 


1 WBR99 


1 TReC-8 1 


VSM 


SBM 


VSM 


SBM 


0 


44.14 


48.81 


63.17 


74.41 


10 


43.03 


48.81 


44.06 


53.63 


20 


42.23 


46.13 


33.87 


38.56 


30 


38.40 


41.93 


26.36 


31.81 


40 


37.41 


39.66 


20.11 


25.69 


50 


37.41 


39.32 


15.35 


21.24 


60 


36.71 


37.74 


10.22 


16.66 


70 


33.97 


36.53 


7.63 


10.88 


80 


29.62 


32.55 


6.48 


7.24 


90 


27.27 


28.25 


3.90 


5.13 


100 


25.00 


26.29 


3.67 


4.32 


Average 


35.92 


38.73 


21.35 


26.33 


Improvement 


- 


7.82 


- 


23.32 



Recall (%) 


1 Precision! %) | 


1 WBR99 


1 TReC-8 1 


VSM 


SBM 


VSM 


SBM 


0 


48.71 


51.38 


42.41 


45.15 


10 


41.48 


43.58 


20.21 


28.61 


20 


27.83 


31.91 


16.07 


21.71 


30 


21.13 


23.35 


12.47 


15.70 


40 


17.59 


18.67 


11.19 


13.92 


50 


10.54 


11.71 


9.25 


10.47 


60 


4.18 


5.22 


7.64 


8.85 


70 


2.94 


3.49 


6.62 


6.97 


80 


2.09 


2.96 


3.92 


4.00 


90 


1.03 


1.57 


3.30 


3.44 


100 


0.23 


1.23 


3.24 


3.37 


Average 


16.16 


17.73 


12.39 


14.75 


Improvement 


- 


9.73 


- 


18.98 



In summary, set-based model (SBM) is the hrst information retrieval model that 
exploits term correlations and term proximity effectively and provides significant gains 
in terms of precision, regardless of the query type. In the next section we discuss the 
computational costs associated with SBM. 



5.4 Computational Efficiency 

In this section we compare our model to the standard vector space model regarding 
the query response time, in order to evaluate its feasibility in terms of computational 
costs. This is important because one major limitation of existing models that account 
for term correlations is their computational cost. Several of these models cannot be 
applied to large or even mid-size collections since their costs increase exponentially 
with the vocabulary size. 
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We compared the response time for the models and collections considered, which 
are summarized in Table 3. We also calculated the increase of the response time of 
SBM when compared to VSM for both query types. All 100,000 queries submitted to 
the TodoBR search engine, excluding the unique term queries, was evaluated for the 
WBR99 collection. We observe that SBM results in a response time increase of 6.21% 
and 24.13% for conjunctive queries and 14.67% and 23.68% for phrase queries when 
compared to VSM for WBR99 and TReC-8, respectively. 



Table 3. Average response time for VSM and SBM 



Query Type 


1 Avg. Response Time (s) | 


1 WBR99 


1 TReC-8 1 


VSM 


SBM 


VSM 


SBM 


conjunctive 


0.632 


0.671 


0.029 


0.036 


phrase 


1.491 


1.709 


0.038 


0.047 



We identify one main reason for the relatively small increase in execution time for 
SBM. Determining maximal termsets and calculating their similarity do not increase 
significantly the cost associated with queries. This fact happens due to the small num- 
ber of query related termsets in the reference collections, especially for WBR99. As 
a consequence, the inverted lists associated tend to be small and are usually manip- 
ulated in main memory in our implementation of SBM. Second, we employ pruning 
techniques that discard irrelevant termsets early in the computation, as described in [ 1 ] . 



6 Related Work 

The vector space model was proposed by Salton [11, 12], and different weighting 
schemes were presented [8, 9]. In the vector space model, index terms are assumed 
to be mutually independent. The independence assumption leads to a linear weighting 
function which, although not necessarily realistic, is ease to compute. 

Different approaches to account for co-occurrence among index terms during the 
information retrieval process have been proposed [13, 14, 15]. The work in [16] presents 
an interesting approach to compute index term correlations based on automatic indexing 
schemes, defining a new information retrieval model called generalized vector space 
model. Wong et al. [17] extended the generalized vector space model to handle queries 
specified as boolean expressions. 

The set-based model was the first information retrieval model that exploits term 
correlations and term proximity effectively and provides significant gains in terms of 
precision, regardless of the size of the collection and of the size of the vocabulary [1, 
2]. Experimental results showed significant and consistent improvements in average 
precision curves in comparison to the vector space model and to the generalized vector 
space model, keeping computational cost small. 

Our work differs from that presented in the following way. First, all known ap- 
proaches that account for co-occurrence patterns was initially designed for processing 
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disjunctive (OR) queries. Our extension to the set-based model provides a simple, ef- 
fective and efficient way to process conjunctive (AND) and phrase queries. Experimen- 
tal results show significant and consistent improvements in average precision curves 
in comparison to the standard vector space model adapted to process these types of 
queries, keeping processing times close to the times to process the vector space model. 

The work in [ 1 8] introduces a theoretical framework for the association rules mining 
based on a boolean model of information retrieval known as Boolean Retrieval Model. 
Our work differs from that presented in the following way. In our work we use associ- 
ation rules to define a new information retrieval model that provides not only the main 
term weights and assumptions of the tf x idf scheme, but also provides for quantifi- 
cation of term co-occurrence and can be successfully used in processing of disjunctive, 
conjunctive and phrase queries. 

Ahonen-Myka [19] employs a versatile technique, based on maximal frequent se- 
quences, for finding complex text phrases from full text for further processing and 
knowledge discovery. Our work also use the concept of maximal termsets/sequences 
to account for term co-occurrence patterns in documents, but the termsets are, success- 
fully, used as a basis for an information retrieval model. 

7 Conclusions and Future Work 

We presented an extension for the set-based model to consider correlations among in- 
dex terms in conjunctive and phrase queries. We show that it is possible to signifi- 
cantly improve retrieval effectiveness, while keeping extra computational costs small. 
The computation of correlations among index terms using maximal termsets enumer- 
ated by an algorithm to generate association rules leads to a direct extension of the 
set-based model. Our approach does not modify its algebraic representation, and the 
changes to the original model are minimal. 

We evaluated and validated our proposed extension for the set-based model for con- 
junctive and phrase queries in terms of both effectiveness and computational efficiency 
using two test collections. We show through curves of recall versus precision that our 
extension presents results that are superior for all query types considered and the addi- 
tional computational costs are acceptable. In addition to assessing document relevance, 
we also showed that the proximity information has application in identifying phrases 
with a greater degree of precision. 

Web search engines are rapidly emerging into the most important application of 
the World Wide Web, and query segmentation is one of the most promising techniques 
to improve search precision. This technique reduce the query into a form that is more 
likely to express the topic(s) that are asked for, and in a suitable manner for a word- 
based or phrase-based inverse lookup, and thus improve precision of the search. We 
will use the set-based model for automatic query segmentation. 
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Abstract. In the area of Text Retrieval, processing a query in the vector 
model has been verified to be qualitatively more effective than searching 
in the boolean model. However, in case of the classic vector model the 
current methods of processing many-term queries are inefficient, in case 
of LSI model there does not exist an efficient method for processing 
even the few-term queries. In this paper we propose a method of vector 
query processing based on metric indexing, which is efficient especially 
for the LSI model. In addition, we propose a concept of approximate 
semi-metric search, which can further improve the efficiency of retrieval 
process. Results of experiments made on moderate text collection are 
included. 



1 Introduction 

The Text Retrieval (TR) models [4, 3] provide a formal framework for retrieval 
methods aimed to search huge collections of text documents. The classic vector 
model as well as its algebraic extension LSI have been proved to be more effec- 
tive (according to precision/recall measures) than the other existing models^. 
However, current methods of vector query processing are not much efficient for 
many-term queries, while in the LSI model they are inefficient at all. In this pa- 
per we propose a method of vector query processing based on metric indexing, 
which is highly efficient especially for searching in the LSI model. 



1.1 Classic Vector Model 

In the classic vector model, each document Dj in a collection C {0 < j < m, 
m = IC'D is characterized by a single vector dj, where each coordinate of dj is 
associated with a term ti from the set of all unique terms in C (0 < f < n, where 
n is the number of terms). The value of a vector coordinate is a real number 
Wij > 0 representing the weight of the z-th term in the j-th document. Hence, 

^ For a comparison over various TR models we refer to [20, 11]. 
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a collection of documents can be represented by an n x m term-hy- document 
matrix A. There are many ways how to compute the term weights Wij stored 
in A. A popular weight construction is computed as tf * idf (see e.g. [4]). 



Queries. The most important problem about the vector model is the querying 
mechanism that searches matrix A with respect to a query, and returns only the 
relevant document vectors (appropriate documents respectively). The query is 
represented by a vector q the same way as a document is represented. The goal is 
to return the most similar (relevant) documents to the query. For this purpose, 
a similarity function must be defined, assessing a similarity value to each pair 
of query and document vectors {q, dj). In the context of TR, the cosine measure 
SlMcos(q,dA = jg widely used. During a query processing, 

the columns of A (the document vectors) are compared against the query vector 
using the cosine measure, while the sufficiently similar documents are returned 
as a result. According to the query extent, we distinguish range queries and 
k-nearest neighbors (k-NN) queries. A range query returns documents similar to 
the query more than a given similarity threshold. A fc-NN query returns the k 
most similar documents. 

Generally, there are two ways how to specify a query. First, a few-term query 
is specified by the user using a few terms, while an appropriate vector for such 
a query is very sparse. Second, a many-term query is specified using a text 
document, thus the appropriate query vector is usually more dense. In this paper 
we focus just on the many-term queries, since they better satisfy the similarity 
search paradigm which the vector model should follow. 

1.2 LSI Vector Model (Simplified) 

Simply said, the LSI (latent semantic indexing) model [11,4] is an algebraical 
extension of the classic vector model. First, the term- by-document matrix A is 
decomposed by singular value decomposition (SVD) as A = USV’^ . The matrix 
U contains concept vectors, where each concept vector is a linear combination 
of the original terms. The concepts are meta-terms (groups of terms) appearing 
in the original documents. While the term-by-document matrix A stores doc- 
ument vectors, the concept-by- document matrix stores pseudo-document 

vectors. Each coordinate of a pseudo-document vector represents a weight of an 
appropriate concept in a document. 



Latent Semantics. The concept vectors are ordered with respect to their sig- 
nificance (appropriate singular values in A). Consequently, only a small number 
of concepts is really significant - these concepts represent (statistically) the main 
themes present in the collection - let us denote this number as k. The remaining 
concepts are unimportant (noisy concepts) and can be omitted, thus the dimen- 
sionality is reduced from n to k. Finally, we obtain an approximation (rank-fc 
SVD) A « UkSkVjF , where for sufficiently high k the approximation error will 
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be negligible. Moreover, for a low k the effectiveness can be subjectively even 
higher (according to the precision/recall values) than for a higher k [3]. When 
searching in a real-world collection, the optimal k is usually ranged from several 
tens to several hundreds. Unlike the term-by-document matrix A, the concept- 
by-document matrix SkVjF as well as the concept base matrix U are dense. 



Queries. Searching for documents in the LSI model is performed the same way 
as in the classic vector model, the difference is that matrix SkV^F is searched 
instead of A. Moreover, the query vector q must be projected into the concept 
base, i.e. C/J q is the pseudo-query vector used by LSI. Since the concept vectors 
of U are dense, a pseudo-query vector is dense as well. 

1.3 Vector Query Processing 

In this paper we focus on efficiency of vector query processing. More specifically, 
we can say that a query is processed efficiently in case that only a small propor- 
tion of the matrix storage volume is needed to load and process. In this section 
we outline several existing approaches to the vector query processing. 



Document Vector Scanning. The simplest method how to process a query 
is the sequential scanning of all the document vectors (i.e. the columns of A, 
UfcVjJ respectively). Each document vector is compared against the query vector 
using the similarity function, while sufficiently similar documents are returned 
to the user. It is obvious that for any query the whole matrix must be processed. 
However, sequential processing of the whole matrix is sometimes more efficient 
(from the disk management point of view) than a random access to a smaller 
part of the matrix used by some other methods. 



Term Vector Filtering. For sparse query vectors (few-term queries respec- 
tively), there exists a more efficient scanning method. Instead of the document 
vectors, the term vectors (i.e. the rows of the matrix) are processed. The cosine 
measure is computed simultaneously for all the document vectors, “orthogo- 
nally” involved in the term vectors. Due to the simultaneous cosine measure 
evaluation a set of m accumulators (storing the evolving similarities between 
each document and the query) must be maintained in memory. The advantage 
of term filtering is that only those term vectors must be scanned, for which the 
appropriate term weights in the query vector are nonzero. The term vector fil- 
tering can be easily provided using an inverted file - as a part of the boolean 
model implementation [15]. 

The simple method of term filtering has been improved by an approximate 
approach [19] reducing the time as well as space costs. Generally, the improve- 
ment is based on early termination of query processing, exploiting a restructured 
inverted file where the term entries are sorted according to the decreasing occur- 
rences of a term in document. Thus, the most relevant documents in each term 
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entry are processed first. As soon as the first document is found in which the 
number of term occurrences is less than a given addition threshold, the process- 
ing of term entry can stop, because all the remaining documents have the same 
or less importance as the first rejected document. Since some of the documents 
are never reached during a query processing, the number of used accumulators 
can be smaller than m, which saves also the space costs. Another improvement 
of the inverted file exploiting quantized weights was proposed recently [2], even 
more reducing the search costs. 

Despite the above mentioned improvements, the term vector filtering is gen- 
erally not so much efficient for many-term queries, because the number of filtered 
term vectors is decreased. Moreover, the term vector filtering is completely use- 
less for the LSI model, since each pseudo-query vector is dense, and none of the 
term vectors can be skipped. 



Signature Methods. Signature files are a popular filtering method in the 
boolean model [13], however, there were only few attempts made to use them in 
the vector model. In that case, the usage of signature files is not so straightfor- 
ward due to the term weights. Weight-partitioned signature files (WPSF) [14] 
try to solve the problem by recording the term weights in so-called TF-groups. 
A sequential file organization was chosen for the WPSF which caused excessive 
search of the signature file. An improvement was proposed recently [16] using the 
S-trees [12] to speedup the signature file search. Another signature-like approach 
is the VA-file [6]. In general, usage of the signature methods is still complicated 
for the vector model, and the results achieved so far are rather poor. 

2 Metric Indexing 

Since in the vector model the documents are represented as points within an 
n-dimensional vector space, in our approach we create an index for the term- 
by-document matrix (for the concept-by-document matrix in case of LSI) based 
on metric access methods (MAMs) [8]. A property common to all MAMs is that 
they exploit only a metric function for the indexing. The metric function stands 
for a similarity function, thus metric access methods provide a natural way for 
similarity search. Among many of MAMs, we have chosen the M-tree. 

2.1 M-Tree 

The M-tree [9, 18, 21] is a dynamic data structure designed to index objects of 
metric datasets. Let us have a metric space M = (U, d) where U is an object 
universe (usually a vector space), and d is a function measuring distance between 
two objects in U. The function d must be a metric, i.e. it must satisfy the axioms 
of reflexivity, positivity, symmetry and triangular inequality. Let S C U be a 
dataset to be indexed. In case of the vector model in TR, an object Oi G S is 
represented by a (pseudo-) document vector of a document Di. The particular 
metric d, replacing the cosine measure, will be introduced in Section 2.2. 
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Like the other indexing trees based on B'^-tree, the M-tree structure is a 
balanced hierarchy of nodes. In M-tree the objects are distributed in a hierarchy 
of metric regions (each node represents a single metric region) which can be, 
in turn, interpreted as a hierarchy of object clusters. The nodes have a fixed 
capacity and a minimum utilization threshold. The leaf nodes contain ground 
entries grnd(Oi) of the indexed objects themselves, while in the inner nodes the 
routing entries rout(Oj) are stored, representing the metric regions and routing 
to their covering subtrees. Each routing entry determines a metric region in space 
M. where the object Oj is a center of that region and is a radius bounding the 
region. For the hierarchy of metric regions (routing entries rout(Oj) respectively) 
in the M-tree, the following requirement must be satisfied: 

All the objects of ground entries stored in the leaves of the covering subtree 
of rout(Oj) must be spatially located inside the region defined by rout{Oj). 

The most important consequence of the above requirement is that many 
regions on the same M-tree level may overlap. An example in Figure 1 shows 
several objects partitioned among metric regions and the appropriate M-tree. 
We can see that the regions defined by routi(Oi), routi{ 02 ), routi{04) overlap. 
Moreover, object O5 is located inside the regions of routi{Oi) and routi{04^) but 
it is stored just in the subtree of rout 1(04). Similarly, the object O3 is located 
even in three regions but it is stored just in the subtree of routi{02)- 




Fig. 1. Hierarchy of metric regions (a) and the appropriate M-tree (b) 

Similarity Queries in the M-Tree. The structure of M-tree natively supports 
similarity queries. The similarity function is represented by the metric function 
d where the close objects are interpreted as similar. 

A range query RamgeQuery (Q,rQ) is specified as a query region given by a 
query object Q and a query radius rg. The purpose of a range query is to retrieve 
all such objects Oi satisfying d{Q,Oi) < rg. A /c-nearest neighbours query {k- 
NN query) kNNQuery (Q , fc) is specified by a query object Q and a number k. A 
/c-NN query retrieves the first k nearest objects to Q. 
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During the range query processing (/c-NN query processing respectively), the 
M-tree hierarchy is being traversed down. Only if a routing entry rout{Oj) (its 
metric region respectively) overlaps the query region, the covering subtree of 
rout{Oj) is relevant to the query and thus further processed. 

2.2 Application of M-Tree in the Vector Model 

In the vector model the objects Oi are represented by (pseudo-)document vec- 
tors di, i.e. by columns of term-by-document or concept-by-document matrix, 
respectively. We cannot use the cosine measure SlMcos{di, dj) as a metric func- 
tion directly, since it does not satisfy the metric axioms. As an appropriate 
metric, we define the deviation metric ddev{di,dj) as a vector deviation 

d(iev(^dij dj^ — n7'ccos(SIAtcos(dj, dj)) 

The similarity queries supported by M-tree (utilizing ddev) are exactly those 
required for the vector model (utilizing SIMcos). Specifically, the range query 
will return all the documents that are similar to a query more than some given 
threshold (transformed to the query radius) while the fc-NN query will return 
the first k most similar (closest respectively) documents to the query. 

In the M-tree hierarchy similar documents are clustered among metric re- 
gions. Since the triangular inequality for ddev is satisfied, many irrelevant doc- 
ument clusters can be safely pruned during a query processing, thus the search 
efficiency is improved. 

3 Semi-metric Search 

In this section we propose the concept of semi-metric search - an approximate 
extension of metric search applied to M-tree. The semi-metric search provides 
even more efficient retrieval, considerably resistant to the curse of dimensionality. 

3.1 Curse of Dimensionality 

The metric indexing itself (as is experimentally verified in Section 4) is benefi- 
cial for searching in the LSI model. However, searching in a collection of high- 
dimensional document vectors of the classic vector model is negatively affected 
by a phenomenon called curse of dimensionality [7,8]. In the M-tree hierar- 
chy (even the most optimal hierarchy) the curse of dimensionality causes that 
clusters of high-dimensional vectors are not distinct, which is reflected by huge 
overlaps among metric regions. 

Intrinsic Dimensionality. In the context of metric indexing, the curse of 
dimensionality can be generalized for general metric spaces. The major condition 
determining the success of metric access methods is the intrinsic dimensionality 
of the indexed dataset. The intrinsic dimensionality of a metric dataset (one of 
the interpretations [8]) is defined as 




Metric Indexing for the Vector Model in Text Retrieval 



189 



where fj, and are the mean and the variance of the dataset’s distance distri- 
bution histogram. In other words, if all pairs of the indexed objects are almost 
equally distant, then the intrinsic dimensionality is maximal (i.e. the mean is 
high and/or the variance is low), which means the dataset is poorly intrinsically 
structured. So far, for datasets of high intrinsic dimensionality there still does 
not exist an efficient MAM for exact metric search. In case of M-tree, a high 
intrinsic dimensionality causes that almost all the metric regions overlap each 
other, and searching in such an M-tree deteriorates to sequential search. 

In case of vector datasets, the intrinsic dimensionality negatively depends on 
the correlations among coordinates of the dataset vectors. The intrinsic dimen- 
sionality can reach up to the value of the classic (embedding) dimensionality. For 
example, for uniformly distributed (i.e. not correlated) n-dimensional vectors the 
intrinsic dimensionality tends to be maximal, i.e. p ~ n. 

In the following section we propose a concept of semi-metric modifications 
that decrease the intrinsic dimensionality and, as a consequence, provide a way 
to efficient approximate similarity search. 

3.2 Modification of the Metric 

An increase of the variance of distance distribution histogram is a straightforward 
way how to decrease the intrinsic dimensionality. This can be achieved by a 
suitable modification of the original metric, preserving the similarity ordering 
among objects in the query result. 

Definition 1. Let us call the increasing modification of a metric ddev a 
function 

^dev Oj ) = / {ddev {Oi,Oj)) 

where / : (0, tt) — > Rq is an increasing function and /(O) = 0. For simplicity, let 
/(tt) = 1. 

Definition 2. Let s : U x U ^ Rq be a similarity function (or a distance 
function) and SimOrderg '■ U ^ 7^(S x S) be a function defined as 

(OijOj) € SimOrders{Q) ^ s{Oi,Q) < s{Oj,Q) 

yOi, Oj £ S, VQ G U. In other words, the function SimOrderg orders the objects 
of dataset S according to the distances to the query object Q. 

Proposition. For the metric ddev and every increasing modification d^^^ the 
following equality holds: 

SimOrderdj^^iQ) = SimOrder.s {Q),^Q G U 

^dev 



Proof: 

“C”: The function / is increasing. If for each Oi,Oj,Ok, Oi G U, ddev{Oi, Oj) > 
ddev{Ok,Oi) holds, then f{ddev{0^,Oj)) > f{ddev{Ok,Oi)) must also hold. 

“D”: The second part of proof is similar. □ 
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As a consequence of the proposition, if we process a query sequentially over 
the entire dataset S, then it does not matter if we use either ddev or since 
both of the ways will return the same query result. 

If the function / is additionally subadditive, i.e. /(a) + f{b) > f{a + b), then 
/ is metric-preserving [10], i.e. f{d{Oi,Oj)) is still metric. More specifically, 
concave functions are metric-preserving (see Figure 2a), while convex (even par- 
tially convex) functions are not - let us call them metric-violating functions (see 
Figure 2b). A metric modified by a metric- violating function / is a semi-metric, 
i.e. a function satisfying all the metric axioms except the triangular inequality. 



Metric-preserving functions 



Metric-violating functions 





Fig. 2. (a) Metric-preserving functions (b) Metric-violating functions 



Clustering Properties. Let us analyze the clustering properties of modifica- 
tions (see also Figure 2). For concave /, two objects close to each other 
according to ddev are more distant according to d^^^. Conversely, for convex 
/, the close objects according to ddev are even closer according to As a 

f 

consequence, the concave modifications have a negative influence on clus- 
tering, since the object clusters become indistinct. On the other side, the convex 
modifications even more tighten the object clusters, making the cluster 
structure of the dataset more evident. Simply, the convex modifications increase 
the distance histogram variance, thereby decreasing the intrinsic dimensionality. 



3.3 Semi-metric Indexing and Search 

The increasing modifications d^^^ can be utilized in the M-tree instead of the 
deviation metric ddev In case of a semi-metric modification d^^^, the query 
processing is more efficient because of smaller overlaps among metric regions in 
the M-tree. Usage of metric modifications is not beneficial, since their clustering 
properties are worsen, and the overlaps among metric regions are larger. 
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Semi-metric Search. A semi-metric modification can be used for all op- 
erations on the M-tree, i.e. for M-tree building as well as for M-tree searching. 
With respect to M-tree construction principles (we refer to [21]) and the propo- 
sition in Section 3.2, the M-tree hierarchies built either by d or d^^^ are the 
same. For that reason, an M-tree built using a metric d can be queried using any 
modification Such semi-metric queries must be extended by the function /, 
which stands for an additional parameter. For a range query the query radius tq 
must be modified to firg). During a semi-metric query processing, the function 
/ is applied to each value computed using d as well as it is applied to the metric 
region radii stored in the routing entries. 



f 

Error of the Semi-metric Search. Since the semi-metric dj^^^ does not satisfy 
the triangular inequality property, a semi-metric query will return more or less 
approximate results. Obviously, the error is dependent on the convexity of a 
modifying function /. As an output error, we define a normed overlap error 

^ ^ n resultscanl 

max{\resultj^l^^^\, \resultscan\) 

where result is a query result returned by the M-tree (using a semi-metric 
query), and resultscan is a result of the same query returned by sequential search 
over the entire dataset. The error E^o can be interpreted as a relative precision 
of the M-tree query result with respect to the result of full sequential scan. 



Semi- metric Search in Text Retrieval. In the context of TR, the searching 
is naturally approximate, since precision/recall values do never reach up to 100%. 
From this point of view, the approximate character of semi-metric search is not 
a serious limitation ~ acceptable results can be achieved by choosing such a 
modifying function /, for which the error Ejqo will not exceed some small value, 
e.g. 0.1. On the other side, semi-metric search significantly improves the search 
efficiency, as it is experimentally verified in the following section. 

4 Experimental Results 

For the experiments we have chosen the Los Angeles Times collection (a part 
of TREC 5) consisting of 131,780 newspaper articles. The entire collection con- 
tained 240,703 unique terms. As “rich” many-term queries, we have used articles 
consisting of at least 1000 unique terms. The experiments were focused on disk 
access costs (DAC) spent during fc-NN queries processing. Each /c-NN query was 
repeated for 100 different query documents and the results were averaged. The 
access to disk was aligned to 512B blocks, considering both access to the M-tree 
index as well as to the respective matrix. The overall query DAC are presented 
in megabytes. The entries of M-tree nodes have contained just the document 
vector identifiers (i.e. pointers to the matrix columns), thus the M-tree storage 
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volume was minimized. In Table 1 the M-tree configuration used for experiments 
is presented (for a more detailed description see [21]). 

The labels of form Deva;a;a; in the figures below stand for modifying functions 
/ used by semi-metric search. Several functions of form DevSQp(a) = were 
chosen. The queries labeled as Dev represent the original metric queries presented 
in Section 2.2. 



Table 1. The M-tree configuration 



Page size: 512 B; Capacity (leaves: 42, nodes: 21) 
Construction: MinMax + SingleWay + SlimDown 
Tree height: 4; Avg. util, (leaves: 56%, nodes: 52%) 



4.1 Classic Vector Model 

First, we performed tests for the classic vector model. The storage of the term- 
by-document matrix (in CCS format [4]) took 220 MB. The storage of M-tree 
index was about 4MB (i.e. 1.8% of the matrix storage volume (MSV)). 

In Figure 3a the comparison of document vector scanning, term vector filter- 
ing as well as metric and semi-metric search is presented. It is obvious that using 
document vector scanning the whole matrix (i.e. 220 MB DAC) was loaded and 
processed. Since the query vectors contained many zero weights, the term vector 
filtering worked more efficiently (76 MB DAC, i.e. 34% of MSV). 



CLASSIC VECTOR MODEL 
disk access costs, k-NN queries 



CLASSIC VECTOR MODEL 
normed overlap error, k-NN queries 
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Fig. 3. Classic vector model: (a) Disk access costs (b) Eno error 



The metric search Dev did not performed well ~ the curse of dimensionality 
(n = 240,703) forced almost 100% of the matrix to be processed. The extra 
30 MB DAC overhead (beyond the 220 MB of MSV) was caused by the non- 
sequential access to the matrix columns. On the other side, the semi-metric 
search performed better. The DevSQIO queries for fc = 5 consumed only 30 MB 
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DAC (i.e. 13.6% of MSV). Figure 3b shows the normed overlap error Emo of 
the semi-metric search. For DevSQ4 queries the error was negligible. The error 
for DevSQ6 remained below 0.1 for k > 35. The DevSQlO queries were affected 
by a relatively high error from 0.25 to 0.2 (with increasing k). 



4.2 LSI Model 

The second set of tests was made for the LSI model. The target (reduced) dimen- 
sionality was chosen to be 200. The storage of the concept-by-document matrix 
took 105 MB, while the size of M-tree index was about 3 MB (i.e. 2.9 % of MSV). 

Because the size of term- by-document matrix was very large, the direct cal- 
culation of SVD was impossible. Therefore, we have used a two-step method 
[17], which in first step calculates a random projection [1, 5] of document vectors 
into a smaller dimensionality of pseudo-concepts. This is done by multiplication 
of a zero-mean unit- variance random matrix and the term-by-document matrix. 
Second, a rank-2fc SVD is calculated on the resulting pseudoconcept-by-document 
matrix, giving us a very good approximation of the classic rank- A: SVD. 



LSI VECTOR MODEL 
disk access costs, k-NN queries 
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LSI VECTOR MODEL 
normed overlap error, k-NN queries 
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Fig. 4. LSI model: (a) Disk access costs (b) Eno error 



The Figure 4a shows that metric search Dev itself was more than twice as 
efficient as the document vector scanning. Even better results were achieved by 
the semi-metric search. The DevSQS queries for fc = 5 consumed only 5.8 MB 
DAC (i.e. 5.5% of MSV). Figure 4b shows the error E^o- For DevSQI . 5 queries 
the error was negligible, for DevSQ2 it remained below 0.06. The DevSQS queries 
were affected by a relatively high error. 

5 Conclusion 

In this paper we have proposed a metric indexing method for an efficient search 
of documents in the vector model. The experiments have shown that metric in- 
dexing itself is suitable for an efficient search in the LSI model. Furthermore, 
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the approximate semi-metric search allows us to provide quite efficient similarity 
search in the classic vector model, and a remarkably efficient search in the LSI 
model. The output error of semi-metric search can be effectively tuned by choos- 
ing such modifying functions, that preserve an expected accuracy sufficiently. 

In the future we would like to compare the semi-metric search with some 
other methods, in particular with the VA-file (in case of LSI model). We also 
plan to develop an analytical error model for the semi-metric search in M-tree, 
allowing to predict and control the output error Ejqo- 

This research has been partially supported by GACR grant No. 201/00/1031. 
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Abstract. Terms which are not explicitly mentioned in the text of a document re- 
ceive often a minor role in current retrieval systems. In this work we connect the 
management of such terms with the ability of the retrieval model to handle partial 
representations. A simple logical indexing process capable of expressing negated 
terms and omitting some other terms in the representation of a document was 
designed. Partial representations of documents can he huilt taking into account 
document length and global term distribution. A propositional model of informa- 
tion retrieval is used to exemplify the advantages from such expressive modeling. 
A number of experiments applying these partial representations are reported. The 
benefits of the expressive framework became apparent in the evaluation. 



1 Introduction 

For many retrieval systems the set of terms that determines the rank of a certain docu- 
ment given a query is solely composed of the terms in common between document and 
query. Nevertheless, it is well known that documents are often vague, imprecise and lots 
of relevant terms are not mentioned. Since topicality is a key component of retrieval en- 
gines, models of Information Retrieval (IR) should avoid to take strong decisions about 
the relationship between terms and document’s semantics. 

Current practice in IR tends to limit unfairly the impact of terms which are not 
explicitly mentioned by a given document. Although the vector-space model maintains 
a dimension for every term of the vocabulary, popular weighting schemes assign a null 
weight for those terms not explicitly mentioned. Similarly, probabilistic approaches, 
whose basic foundations allow to consider all the terms of the alphabet to do retrieval, 
tend to reduce the computation to the set of terms explicitly mentioned by a given 
document [14]. A notable exception is located in the context of the Language Modeling 
(LM) approaches [9, 2]: a term t which is not present in a document d is not considered 
as impossible in connection with the document’s semantics but t receives a probability 
value greater than zero. This value grows with the global distribution of the term in the 
document collection, i.e. if t is frequently used by documents in the collection then it 
is possibly related to the document d. This is a valuable approach because it opens a 
new way to handle terms not explicitly mentioned in a given document but, on the other 
hand, the opposite problem arises: no one term can be considered totally unrelated to 
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a given document. This is because all the probability values coming from every query 
term are multiplied together and, hence, if zero probabilities are allowed then we would 
assign a null probability to any document that regards one or more of the query terms 
as unrelated. 

In this work we propose an alternative way for handling both situations. A term t 
which is not explicitly mentioned by a document d may be considered as: a) totally 
unrelated to d and, hence, if a query uses t then the document d is penalized (this pe- 
nalization should not be as extreme as assigning a retrieval status value of 0 for d) or b) 
possibly related to d and, hence, a non-zero contribution is computed for modeling the 
possible connection between t and d. 

A formalism allowing partiality can distinguish between: a) lack of information 
about the actual connection between a given topic and a particular document, b) cer- 
tainty that a given topic is completely out of the scope of a given document and c) 
certainty that a given topic is totally connected to the contents of a given document. In 
particular, logic-based models [15,1] supply expressive representations in which these 
situations can be adequately separated. In this work we use a logical model of IR based 
on Propositional Logic and Belief Revision (PLBR) [6, 8] to exemplify the advantages 
of the logical modeling. We design a novel logical indexing method which builds ex- 
pressive document representations. The logical indexing is driven by global term distri- 
bution and document length. In this way, intuitions applied in the context of document 
length normalization [13, 11] and LM smoothing techniques [9] can be incorporated 
into the logical formalism. This indexing approach was empirically evaluated revealing 
the advantages of the approach taken. 

The rest of this paper is organized as follows. In section 2 the foundations of the 
logical model are presented. This section is intentionally brief because further details 
can be found in the literature. Section 3 addresses the construction of partial represen- 
tations for documents in connection with global term distribution and document length. 
Experiments are reported in section 4 and section 5 offers an analysis a posteriori of the 
behaviour of the indexing method. Some conclusions and possible avenues of further 
research are presented in section 6. 

2 The Model 

Given a document and a query represented as propositional formulas d and q, respec- 
tively, it is well known that the notion of logical consequence (i.e. d |= g) is rather strict 
for retrieval because it yields a binary relevance decision [15]. ThePLBRmodel defines 
a measure of closeness between d and q which can straightforwardly be used to build a 
formal rank of documents induced by the query [6, 8]. 

Dalai’s Belief Revision measure of distance between logical interpretations [3] 
stands on the basis of the PLBR approach. A query q can be seen as the set of logical 
interpretations satisfying q, i.e. the set of models of q. The distance from each model of 
the document to the query is computed as the minimum distance from the model of the 
document to the query models. The final distance from the document to the query is the 
average distance from document models to the query. 

Given a model of the document and a model of the query, the original PLBR dis- 
tance basically counts the number of disagreements, i.e. the number of propositional 
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letters with different interpretation'. This approach was later extend to define a new 
measure of distance between logical interpretations that takes into account inverse doc- 
ument frequency (idf) information [4]. Within this measure, every letter mapped into 
the same truth value by both interpretations produces an increment to the final dis- 
tance that depends on its idf value. Note that this extension maintains the propositional 
formalism for representing documents and queries but introduces idf information for 
distance computation. As it will be explained later, in this paper we will use idf infor- 
mation for producing negated terms in the logical document representations. Observe 
that both uses of idf information are different because the former is done at matching 
time whereas the latter is done at indexing time. 

The PLBR distance can be computed in polynomial time provided that d and q are 
in disjunctive normal form (DNF) [7]. A prototype logical system was implemented to 
evaluate the PLBR model against large collections. The experiments conducted revealed 
important benefits when handling expressions involving both logical conjunctions and 
disjunctions [5]. 

Nevertheless, the logical indexing applied so far was rather simplistic. No major 
attention was paid to the design of evolved techniques to produce more expressive doc- 
ument logical representations. In particular, the use of logical negations was left aside, 
which is precisely the aim of this work. 

3 Partial Representations for Documents 

The PLBR model has provision for establishing a distinction between a term for which 
we do not know whether or not it is significant with respect to a given document’s 
semantics and a term for which we have positive evidence that it is not related at all 
with document’s contents. The latter case naturally leads to a negated expression of the 
term within document’s representation whereas the most sensible decision regarding 
the former case is to omit the term in the document’s representation. 

We hrst present some heuristics that can be applied to identify appropriate terms 
to be negated and then we give a further step to connect the new logical indexing with 
document’s length. 

3.1 Negative Term Selection 

Let us consider a logical representation of a document in which only the terms that ap- 
pear in the text of the document are present as positive literals^ (let us call this conserva- 
tive setting as negate-nothing approach). Of course, many of the terms not mentioned 
by the document (unseen terms) will undoubtedly be disconnected with document’s 
contents and, hence, to omit those terms within the document’s representation does not 
seem to be the best choice. On the contrary, a negated representation of those terms 
appears as a good alternative. To negate every unseen term is also unfair (negate-all 
approach) because there will be many topics that, although not explicitly mentioned, 
are strongly connected with document’s semantics. 

' Note that every indexing term is modelled as a propositional letter in the alphabet. 

^ A literal is a propositional letter or its negation. 
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We propose a logical indexing strategy that negates some unseen terms selected on 
the basis of their distribution in the whole collection. Note that this global information 
is also used in the context of LM smoothing strategies for quantifying the relatedness 
between unseen terms and document’s contents. More specifically, a null probability is 
not assigned for a term which was not seen in the text of a document. The fact that we 
have not seen it does not make it impossible. It is often assumed that a non-occurring 
term is possible, but no more likely than what would be expected by chance in the 
collection. 

If a given term is infrequent in the document base then it is very unlikely that doc- 
uments that do not mention it are actually related to this topic (and, thus, very unlikely 
that any user that wants to retrieve those documents hnds the term useful when express- 
ing her/his information need). On the other hand, frequent terms are more generic and 
have more chance to present connections with the topics of documents even in the case 
when they are not explicitly mentioned. This suggests that unseen infrequent terms are 
good candidates to formulate negations in the logical indexing process. 

The obvious intention when negating a term in a document’s representation is to 
move the document away from queries mentioning the term. Consider a query term 
which is missing in the text of a given document. If the query term is globally infrequent 
and, thus, it had been negated within the document’s representation then the document 
will be penalized. On the contrary, if the term is globally frequent and it was omitted 
in the document’s representation, then the penalization is much lower. This is intuitive 
because frequent terms have much more chance of being connected with the contents 
of documents that do not explicitly mentioned them. 

3.2 Document Length 

We now pay attention to the issue of the number of terms that should be negated in 
the representation of every document. In this respect, a first question arises: is it fair to 
negate the same number of terms for all documents? In the following we try to give a 
motivated answer. 

Let Td be the subset of terms of the alphabet (T) that are present in the text of 
a document d. Consider that we decide to introduce k negated terms in the logical 
representation of d. That is, every term in Td will form a positive literal and k terms in 
T\Td (the k terms in T\Td most infrequent in the collection) produce k negated literals. 
If we introduce the same number of negations for all documents in the collection we 
would be implicitly assuming that all documents had the same chance of mentioning 
explicitly all their relevant topics. This assumption is not appropriate. 

A long document may simply cover more material than a short one. We can even 
think on a long document as a sequence of unrelated short documents concatenated 
together. This view is called the scope hypothesis and contrasts with the verbosity hy- 
pothesis, in which a long document is supposed to cover a similar scope than a short 
document but simply uses more words [1 1]. It is accepted that the verbosity hypothesis 
prevails over the scope hypothesis. Indeed, the control of verbosity stands behind the 
success of high performance document length normalization techniques [13, 1 1]. 

This also connects with recent advances on smoothing strategies for Language Mod- 
eling. For instance, a bayesian predictive smoothing approach takes into account the 
difference of data uncertainty in short and long documents [16]. As documents are 
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larger, the uncertainty in the estimations becomes narrower. A similar idea will drive 
our logical indexing process because long documents are supposed to indicate more 
exhaustively their contents and, hence, more assumptions on the non-related terms will 
be taken. 

A fixed number of negations for every document is also not advisable from a practi- 
cal perspective. Think that the sets T\Td are very large (because Td «T) and, hence, 
there will surely be many commom terms in the sets T\Td across all documents. As a 
consequence, there will likely be little difference between the negated terms introduced 
by two different documents and, therefore, the effect on retrieval performance will be 
unnoticeable. 

In this work we propose and evaluate a simple strategy in which the number of 
negations grows linearly with the size of the document. In our logical indexing process, 
the size of a document will be measured as the number of different terms mentioned by 
the document. 

Another important issue affects the maximum and minimum number of negations 
that the logical indexing will apply. Let us assume that, for a given document d, we 
decide to include 1000 negated literals in its logical representation. Since the number 
of negations is relatively low (w.r.t. current term spaces), the involved terms will be 
very infrequent, most of them mentioned by a single document in the whole collection 
and, therefore, it is also very unlikely that any query finds them useful to express an 
information need. As a consequence, a low number of negations will definitely not 
produce any effect on retrieval performance because the negated terms are rare and will 
be hardly used by any query. This advances that significant changes on the retrieval 
behaviour of the logical model will be found when the number of negations is high. 
Inspired by this, we designed our logical indexing technique starting from a total closed- 
world assumption (i.e. we negate every unseen term) and we reduce the number of 
negations as document’s size decreases. That is, instead of starting from a representation 
with 0 negated terms which is repeteadly populated by negations involving infrequent 
terms, we start from a logical formula with T \ Td negated terms and we repeteadly 
omit globally frequent terms^. 

We define now the number of terms that will be omitted in the logical representation 
of a given document as a function of the size of the document: 



OTd 



maxdl-dU 
max-dl — minjil 



( 1 ) 



where did is the size of the document d, max-dl (miri-dl) is the size of the largest 
(shortest) document and MAXJDT is a constant that determines the maximum number 
of terms for which the logical indexing will not make any strong decision and, hence, 
no literal, either positive or negative, will be expressed"^. 

^ In the future we also plan to articulate an indexing process which skips globally infrequent 
terms and, hence, these procedures will be revisited. 

Of course, MAXJDT should be lower or equal than the smallest value of |T \ Td| computed 
across all documents. Otherwise, the indexing process could suggest a value of OTd, such that 
OTd > |r\Td|. This indexing could only be implemented by considering some explicit terms 
in Td as non-informative words that should be omitted the representation of the document. 
Obviously, this is not the intention pursued here. 
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To sum up, every document will be represented as a logical formula in which; 

- Terms appearing explicitly in the text of the document, t G Td, will be positive 
literals in the representation of the document. 

- Terms not mentioned explicitly, t G T \ are ranked in decreasing order of ap- 
pearances within the whole collection and: 

• Top OTd terms will be omitted in the representation of d. 

• The remaining terms will be negative literals in the logical formula represent- 
ing d. 



T = {o, 6, c, d, e, /, g, h, i, j, l,m,n, o, p, q, r, s, t, u} 



Document 


Td (explicit terms) 


di 

d2 


a, r 

a, c, d^ e, u, t 



max-dl = 10 
minjdl = 2 
MAX.OT = 10 

OTd, = • 10 = 10 
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Document 


omitted terms 
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di 

d2 
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Document 


Logical representation 


di 

d2 


a A r A A “i/i A -ig A -■/ A -le A -id A -ic A -i& 

aAcAdAeAuAtA ~rn A A A M A M A A ~^g A A ~<b 



Fig. 1. Logical indexing process 



Figure 1 illustrates an example of this logical indexing process. The vocabulary of 
20 terms is supposed to be ordered in increasing order of appearance within the whole 
collection. The largest document is supposed to have 10 terms whereas the shortest one 
(di) mentions just two terms. The constant MAXJDT is assumed to be equal to 10. 
Observe that, a closed-world assumption indexing would assign 18 and 14 negations to 
di and d 2 , respectively, whereas the length-dependent indexing assigns 8 negations to 
the short document di and 9 negations to the long document. Note that the final logical 
representation of a long document is more complete because there will be few omitted 
terms and, on the contrary, representations of short documents are more partial. 

The tuning constant MAXJJT is an instrument to make explicit control on par- 
tiality. If MAXJJT = 0 then the system does not allow partiality in the logical repre- 







202 



David E. Losada and Alvaro Barreiro 



Table 1. Training phase - Tuning partiality 



Topics #151-#200 




cwa indexing 


MAX_OT 


MAX_OT 


MAX.OT 


MAX_OT 


MAX.OT 


MAX_OT 


MAX.OT 


a 




1000 


2000 


3000 


4000 


5000 


10000 


50000 


0.4 


0.0719 


0.1320 


0.1544 


0.1475 


0.1420 


0.1422 


0.1136 


0.0736 




1533 


2013 


2090 


1849 


1912 


1845 


1639 


1539 


0.5 


0.1055 


0.1470 


0.1687 


0.1562 


0.1537 


0.1613 


0.1526 


0.1075 




1760 


2010 


2048 


1786 


1837 


1950 


1810 


1764 


0.6 


0.1520 


0.1561 


0.1513 


0.1289 


0.1041 


0.1290 


0.1426 


0.1452 




1751 


1864 


1738 


1447 


1298 


1578 


1522 


1748 



sentations and, therefore, all the vocabulary terms have to be mentioned either positive 
or negative. As MAXJJT grows logical representations become more partial. Obvi- 
ously, very low values of MAX_OT will not permit to establish significant differences 
between the indexing of short and long documents. 



4 Experiments 

This logical indexing was evaluated against the WSJ subset of the TREC collection 
in discs 1&2. This collection constains 173252 articles published in the Wall Street 
Journal between 1987 and 1992. 

We took 50 TREC topics for training the MAXJJT parameter (TREC topics #151 
- #200) and a separate set of topics is later used for validating previous findings (TREC 
topics #101 - #150). For each query, top 1000 documents were used for evaluation. 

Documents and topics were preprocessed with a stoplist of 571 common words 
and remaining terms were stemmed using Porter’s algorithm [10]. Logical queries are 
constructed by simply connecting their stems through logical conjunctions. Queries 
are long because the subparts Title, Description and Narrative were all considered. 
Stemmed document terms are directly incorporated as positive literals and some negated 
terms are included in the conjunctive representation of a document depending on doc- 
ument’s length and term’s global frequency. In order to check whether or not this new 
logical indexing improves the top performance obtained by the PEER model so far, we 
first ran a number of experiments following a closed-world assumption (i.e. all terms 
which are not mentioned by the document are incorporated as negated literals). Recall 
that the PEER model handles idf information when measuring distances between logi- 
cal interpretations. This effect is controlled by a parameter a. We tried out values for a 
from 0.9 to 0. 1 in steps of 0. 1 . Since the major benefits were found when 0.4 < a < 0.6, 
we only present performance results for a = 0.4, 0.5, 0.6. On the second column of ta- 
ble 1 (cwa indexing) we show performance ratios (non-interpolated average precision 
& total number of relevant retrieved documents) for the cwa indexing approach on the 
training set. The best results were found for a value of a equal to 0.6 (in bold). 

Columns 3rd to 9th of table 1 depict performance results for the more evolved logi- 
cal indexing with varied number of omitted terms. Not surprinsingly, for high values of 
omitted terms (> 50000) performance tends to the performance obtained with the basic 
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Table 2. Test phase - Effect of partiality 





Topics #101-#150 




cwa indexing trained 


dl indexing not trained 


Recall 


a = 0.6 


MAX.OT = 2000, a = 0.5 


0.00 


0.4576 


0.5379 


0.10 


0.2842 


0.3317 


0.20 


0.2154 


0.2639 


0.30 


0.1788 


0.2075 


0.40 


0.1445 


0.1600 


0.50 


0.1195 


0.1240 


0.60 


0.0923 


0.0993 


0.70 


0.0717 


0.0684 


0.80 


0.0397 


0.0373 


0.90 


0.0188 


0.0131 


1.00 


0.0098 


0.0035 


Avg.prec. 

(non-interpolated) 


0.1319 


0.1482 


% change 




-h12.4% 


Total relevant 
retrieved 


1828 


2301 


% change 




-h25.9% 



indexing (first column). This is because the ratio negatedJterms / omittedJterms is 
so low that almost every query term is either matched by a document or it was omitted. 
There are very few negations and, hence, the distinction between those classes of terms 
is unnoticeable. On the other hand, for relatively low values of MAXJJT (between 
1000 and 5000) performance tends to improve with respect to cwa indexing. The best 
training run is obtained when MAXJJT = 2000, a = 0.5 (0.1687 vs 0.152, 11% im- 
provement in non-interpolated average precision and 2048 vs 1751, 17% more relevant 
documents retrieved). 

In order to confront previous findings, we ran additional experiments with the test 
set of topics. We fixed a value of 2000 omitted terms and a = 0.5 for the new indexing 
approach. Although this is the test phase, we trained again the parameter a for the basic 
indexing policy (cwa) and we show here the best results (a = 0.6). This is to assure 
that the new document length indexing without training can improve the best results at- 
tainable with the basic cwa indexing. The results are depicted in table 2. Major benefits 
are found when partial representations are handled. It seems clear that the considera- 
tion of document length to omit up to 2000 terms improves significantly the retrieval 
performance of the logical model. 

This experimentation suggests to omit a relatively low number of omitted terms 
with respect to the total vocabulary size. This means that the shortest document will 
be able to have 2000 omitted terms within its logical representation. These 2000 terms 
will be those more globally used that are not present in that small document. It is well 
known [12] that the large majority of the words occurring in a corpus have very low 
document frequency. This means that most terms are used just once in the whole col- 
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lection and, hence, it is also unlikely that any query makes use of them. That is, only a 
small fragment of the vocabulary (the most frequent ones) makes a significant impact 
on retrieval performance. Indeed, in the WSJ collection that we indexed, 76839 terms 
out of 163656 (which is the vocabulary size after preprocessing) are only mentioned in 
a single document. This explains why the major differents in performance are found for 
small values of MAXJDT. 



5 Analysis 

In this section we provide an additional analysis of the logical indexing keeping track 
of its behaviour against document length. We will follow the methodology designed by 
Singhal, Buckley and Mitra [13] to analyze the likelihood of relevance/retrieval for doc- 
uments of all lengths and plot these likelihoods against the document length to compare 
the relevance pattern and the retrieval pattern. 

First, the document collection is ordered by document’s length and documents are 
divided into equal-sizes chunks, which are called bins. For our case, the 173252 WSJ 
documents were divided into 173 bins containing one thousand documents each and 
an additional bin contained the 252 largest documents. For the test topics (#101-#150) 
we then took the 4556 (query, relevant document) pairs and counted how many pairs 
had their document from the ith bin. These values allow to plot a relevance pattern 
against document length. Specifically, fhe conditional probability P(D € ith bin\D is 
relevant) can be computed as the ratio of the number of pairs that have the document 
from the ith bin and the total number of pairs. 

A given retrieval strategy will present a good behaviour against document’s length 
provided that its probability of retrieval for the documents of a given length is very close 
to the probability of finding a relevant document of that length. Therefore, once we have 
a relevance pattern, we can compute the retrieval pattern and compare them graphically. 
We will compute the retrieval pattern for both the cwa PLBR run and the PLBR run with 
document length-dependent indexing. Comparing them with the relevance pattern we 
will be able to validate the adequacy of our document’s length-dependent indexing and, 
possibly, identify further avenues of research. 

The retrieval pattern’s computation is also very simple. For each query the top one 
thousand documents retrieved are selected (for our case, 50.000 (query, retrieved docu- 
ments) pairs) and, for each bin, we can directly obtain P{D G ith bin\D is retrieved) . 

Figure 2 shows the probability of relevance and the probability of retrieval of the 
cwa PLBR run plotted against the bin number (2(a)). The probability of relevance and 
the probability of retrieval applying the document length-dependent logical indexing 
are plotted in fig. 2(b). Recall fhaf bin #1 confains fhe smallesf documenfs and bin #174 
contains! the largest documents. Following that figure, there is no clear evidence about 
the distinction between both approaches. In figure 3 we plof cwa indexing and dl in- 
dexing against document length. Although the curves are very similar, some trends can 
be identified. For bins #1 to #100 the dl indexing approach retrieves documents with 
higher probability than the cwa approach. On the other hand, very long documents (last 
20 bins) are retrieved with higher probability if the cwa strategy is applied. This demon- 
strates that the dl indexing procedure does its job because it tends to favour short doc- 
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Retrieval using PLBR cwa vs relevance Retrieval using PLBR dl Indexing vs relevance 




(a) relevance vs PLBR cwa (b) relevance vs PLBR dl indexing 

Fig. 2. Probability of relevance vs probability of retrieval 



uments w.r.t. long ones. Nevertheless, this analysis also suggests new ways to improve 
the document length logical indexing. The most obvious is that very long documents do 
still present a probability of being retrieved which is much greater than the probability 
of relevance (see fig. 2(b), last 20 bins). This suggests that the formula that computes 
the number of omitted terms (equation 1, section 3.2) should be adapted accordingly. 
As a consequence, subsequent research effort will be directed to the fine tuning of the 
document length-dependent indexing. 



6 Conclusions and Future Work 

In this work we have proposed a novel logical indexing technique which yields a nat- 
ural way to handle terms not explicitly mentioned by documents. The new indexing 
approach is assisted by popular IR notions such as document length normalization and 
global term distribution. The combination of those classical notions and the expressive- 
ness of the logical apparatus leads to a precise modeling of the document’s contents. 
The evaluation conducted confirms empirically the advantages of the approach taken. 

Future work will be focused in a number of lines. First, as argued in the previous 
section, document length contribution should be optimized. Second, more evolved tech- 
niques to negate terms will also be investigated. In this respect, the application of term 
similarity information is especially encouraging for avoiding negated terms whose se- 
mantics is close to some of the terms which appear explicitly in the text of a document. 

Our present document length strategy captures verbosity by means of document 
length. Although it is sensible to think that there is a correlation between document 
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Fig. 3. cwa indexing vs dl indexing 



length and verbosity, it is also very interesting to study new methods to identify ver- 
bose/scope documents and tune the model accordingly. 
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Abstract. In [2] Navarro and Baeza-Yates found their so-called hy- 
brid index to be the best alternative for indexed approximate search 
in English text. The original hybrid index is based on Levenshtein edit 
distance. We propose two modifications to the hybrid index. The first 
is a way to accelerate the search. The second modihcation is to make 
the index permit also the error of transposing two adjacent characters 
(“Dameran distance”). A full discussion is presented in Section 11 of [1]. 



Let ed{A,B) denote the edit distance between strings A and B, |A| denote the 
length of A, A^ denote the fth character of A, and Ai,,j denote the substring of A 
that begins from its zth and ends at it jth character. Given a length-m pattern 
string P, a length-n text string T, and an error limit k, the task of approximate 
string matching is to find such text positions j where ed{P, Th..j) < k and h < j. 
Levenshtein edit distance edL {A, B) is the minimum number of single-character 
insertions, deletions and substitutions needed in transforming A into B or vice 
versa. Damerau edit distance edD{A,B) is otherwise similar but permits also 
the operation of transposing two permanently adjacent characters. 

Using an index structure during the search can accelerate approximate string 
matching. One such index is the hybrid index of Navarro & Baeza-Yates [2] for 
Levenshtein edit distance, which they found to be the best choice for searching 
English text. It uses intermediate partitioning, where the pattern is partitioned 
into j pieces P^,..,P^ , and then each piece P* is searched for with d® = [fc/jj 
errors. If j > 1 and a hit Tj-h..j is found so that edriP^ ,Tj-h..j) < db the text 
area Tj-m-k..j+m+k will be included in a check for a complete match of P with 
k errors. The hits for each piece P® are found by a depth-first search (DFS) over 
a suffix tree^ built for the text. This involves filling a dynamic programming 
table D, where D[r,l] = ed(Pb.j., during the DFS. When the DFS 

arrives at a node that corresponds to the text substring the distances 

edi(Pb,j., are computed for r = 1 . . .m*, where nrd = |P*|- 

Our main proposal for accelerating the DFS is as follows. When the DFS 
reaches a depth-Z node that corresponds to the text substring Tj+i,,j+i and 

* Supported by Tampere Graduate School in Information Science and Engineering. 

^ A trie of all suffixes of the text in which each suffix has its own leaf node and the 
position of each suffix is recorded into the corresponding leaf. 
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Fig. 1. Figure a): Matrix D for computing edn(Pi Tj+i..), where PI = “thesis”, 
Tj+i.. = “there..”, and d* = 2. Now D[r,5] > d® = 2 for r = l...m®, and the only 
way to reach a cell value D[m'‘,x] < 2, where ® > 5, is to have only matches at the 
remaining parts of the top-left-to-bottom-right diagonals with the value D[h,5] = 2. 
The cells in these diagonal extensions have the value d® = 2 underlined, and the pattern 
suffixes corresponding to the cell values D[3, 5], D[4, 5] and D[5, 5] (shown in bold) are 
P 4 g = “sis”, P 5 .,g = “is” and Pg = “s”, respectively. Figure b): The ratio between 
the running time of our improved DFS (OURS) and the runtime of the original DFS 
of Navarro and Baeza- Yates (NBY). We tested with two « 10 MB texts: Wall Street 
Journal articles (WSJ) and the DNA of baker’s yeast (yeast). The computer was a 600 
Mhz Pentium 3 with 256 MB RAM, Linux OS and GCC 3.2.1 compiler. 



where D[r, 1] > d® for r = the only strings that have as a prefix 

and match P® with d® errors are of form o P^_j _4 where o denotes 

concatenation and h fulfills the condition D[h,l] = d®. In this situation we check 
directly for the presence of any of these concatenated substrings, and then let 
the DFS backtrack. Fig. la illustrates, and Fig. lb shows experimental results 
from a comparison against the original DFS of [2] when P® = P and d® = k. 

In addition we propose the following lemma for partitioning P under Dam- 
erau distance. It uses classes of characters, which refers to permitting a pattern 
position to match with any character enumerated inside square brackets. For 
example P = “thes[ei]s” matches with the strings “theses” and “thesis”. 

Lemma 1. Let P®, i = l..j, he j non-overlapping substrings of the pattern P 
that are ordered so that P®+^ occurs on the right side of P® in P. Also let B 
he some string for which edD{P,B) < k, let each P® he associated with the 
corresponding number of errors d®, and let strings P®, i = l..j, be defined as 
follows: 

P® = P®, if i = j or P® and P®+^ do not occur consecutively in P. 

° otherwise. 

IfELi d® > fc — j + 1, then one of the strings P® matches inside B with at most 
d® errors. 
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Main Results. The basic string matching problem is to determine the occurrences 
of a short pattern P = pip 2 ■ ■ ■ Pm in a large text T = t\t 2 ■ ■ - tn, over an alphabet 
of size a. Indexes are structures built on the text to speed up searches, but they 
used to take up much space. In recent years, succinct text indexes have appeared. 
A prominent example is the FM-index [2] , which takes little space (close to that 
of the compressed text) and replaces the text, as it can search the text (in optimal 
0(m) time) and reproduce any text substring without accessing it. The main 
problem of the FM-index is that its space usage depends exponentially on a, 
that is, 5Hkn + cr^o(n) for any k, being the fc-th order entropy of T. 

In this paper we present a simple variant of the FM-index, which removes its 
alphabet dependence. We achieve this by, essentially (but not exactly), Huffman- 
compressing the text and FM-indexing the binary sequence. Our index needs 
2n{Ho + 1)(1 -I- o(l)) bits, independent of cr, and it searches in 0{m{Ho + 1)) 
average time, which can be made 0(m log cr) in the worst case. Moreover, our 
index is considerably simpler to implement than most other succinct indexes. 

Technical Details. The Burrows- Wheeler transform (BWT) [1] of T is a 

permutation of T such that is the character preceding the i-th lexico- 

graphically smallest suffix of T. The FM-index finds the number of occurrences 
of P in T by running the following algorithm [2]: 



Algorithm FM_Search(P,T*'”*) 
i = m- sp = T, ep — n; 
while {{sp < ep) and (i > 1) do 
c = P[i - 1]; 

sp = C[c] + Occ{T^'^\ c, sp - l)-tl; 
ep = G[c] -tOcc(T*'”*,c,ep); 
i = i — T, 

if {ep < sp) then return “not found” else return “found {ep — sp -|- 1) occs”. 



The index is actually formed by array C[-], such that C[c] is the num- 
ber of characters smaller than c in T, and function Occ(T*'“'*, •, •), such that 
Occ{T^'"* , c, i) is the number of occurrences of c in F**™*)! . . .i]. The exponential 
alphabet dependence of the FM-index is incurred in the implementation of Occ 
in constant time. 
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Our idea is first to Huffman-compress T so as to obtain T' , a binary string 
of length n' < n{Ho + 1). Then, if we encode P to P' with the same codebook 
used for T, it turns out that any occurrence of P in T is also an occurrence of 
P' in T' (but not vice versa, as P' may match in the middle of a code in T') . 

We apply the BWT to T' to obtain array B = of n' bits. Another 

array Bh signals which bits of B correspond to beginning of codewords in T'. If 
we apply algorithm FM_Search(P',P), the result is the number of occurrences 
of P' in T' . Moreover, the algorithm yields the range [sp,ep] of occurrences in 
B. The real occurrences of P in T correspond to the bits set in Bh[sp . . . ep]. 

Function rank{Bh,i), which tells how many bits are set in Bh[l . . .i], can 
be implemented in constant time by storing o(n') bits in addition to Bh [4]. So 
our number of occurrences is rank{Bh, ep) — rank{Bh, sp — 1). 

The advantage over the original FM-index is that this time the text T' is 
binary and thus Occ{B,l,i) = rank{B,i) and Occ{B,0,i) = i — rank{B,i). 
Hence we can implement Occ in constant time using o{n') additional bits, inde- 
pendently of the alphabet size. 

Overall we need 2n(Po + l)(l + o(l)) bits, and can search for P in 0{m{Ho + 
1)) time if P distributes as T. By adding l-|-£ bits, for any e > 0, we can find the 
text position of each occurrence in worst case time 0{{l/e){Ho + 1) logn), and 
display any text substring of length L in 0{{l/s){Ho + 1)(T -I- logn)) average 
time. By adding other 2n bits, we can ensure that all 0{Hq + 1) values become 
0{loga) in the worst case times. For further details and experimental results 
refer to [3]. 
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We consider the problem of finding all approximate occurrences of a given string 
q, with at most k differences, in a finite database or dictionary of strings. The 
strings can be e.g. natural language words, such as the vocabulary of some 
document or set of documents. This has many important application in both 
off-line (indexed) and on-line string matching. More precisely, we have a universe 
U of strings, and a non-negative distance function d : U x U ^ N. The distance 
function is metric, if it satisfies (i) d{x,y) = 0 x = y; (ii) d{x,y) = d{y,x); 
(iii) d{x, y) < d{x, z) + d{z, y). The last item is called the “triangular inequality”, 
and is the most important property in our case. Many useful distance functions 
are known to be metric, in particular edit (Levenshtein) distance is metric, which 
we will use for d. 

Our dictionary S' is a finite subset of that universe, i.e. S C U. S is pre- 
processed in order to efficiently answer range queries. Given a query string q, 
we retrieve all strings in S that are close enough to q, i.e. we retrieve the set 
{m G S I d{q,u) < k} for some k. 

To solve the problem, we build a metric index over the dictionary, and use 
the triangular inequality to efficiently prune the search. This is not a new idea, 
huge number of different indexes have been proposed over the years, see [2] for a 
recent survey. An example of such an index is the Burkhard-Keller tree [1]. They 
build a hierarchy as follows. Some arbitrary string (called pivot) p G S is chosen 
for the root of the tree. The child number e is recursively built using the set 
Se = {u G S \ {p} I d{p, u) = e}. This is repeated until there are only one, or in 
general b (for a bucket), strings left, which are stored into the leaves of the tree. 
The tree has 0(n) nodes, where n = IS”!, and the construction requires 0(n log n) 
distance computations on average. The search with the query string q and range 
k first evaluates the distance d{q,p), where p is the string in the root of the tree. 
If d{q,p) < k, then p is put into the output list. The search then recursively 
enters into each child e such that d{q,p) — k < e < d{q,p) + k. Whenever the 
search reaches a leaf, the stored bucket of strings are directly compared against q. 
The search requires 0{n°^) distance computations on average, where 0 < a < 1. 

Another example is Approximating Eliminating Search Algorithm (AESA) 
[4], which is an extreme case of pivot based algorithms. This time there is not 
any hierarchy, but the data structure is simply a precomputed matrix of all the 
n(n— 1) /2 distances between the n strings in S. The space complexity is therefore 
0{n^) and the matrix is computed with 0{n^) edit distance computations. This 
makes the structure highly impractical for large n. The benefit comes from search 
time, empirical results have shown that it needs only a constant number of 
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distance evaluations on average. However, each distance evaluation takes 0(n) 
“extra CPU time” . 

The problem with AESA is its high preprocessing and space complexities. For 
small dictionaries this is not a problem, so we propose using AESA to implement 
additional search algorithm for the buckets of b strings stored into the leaves of 
the tree based indexes such as BKT. This means that the space complexity 
becomes 0{nb), and the construction time 0{n{b + log{n/b))). In effect this 
makes the index memory adaptive. We can adjust b to make a good use of the 
available memory to reduce the number of distance computations. We call the 
resulting algorithm ABKT. Another way to trade space for time is to collapse 
children d{q,p) — k<e< d{q,p) + k into a single branch, and at the search time 
enter only into child {d{q,p), k). This can be done only for levels up to (. of the 
tree to keep the memory requirements low. We call this algorithm E(BP)BKT. 

The recent bit-parallel on-line string matching algorithm in [3] can be easily 
modified to compute several edit distances in parallel for short strings, i.e. we 
can compute the edit distance between q and r other strings, each of length m, 
in time 0(|g|), where r = \ w /m\ and w is the number of bits in computer word 
(typically 32 or 64, or even 128 with the SIMD extensions of recent processors). 
Simplest application of this technique in BKT is to store a bucket of r strings 
into each node, instead of only one (the pivot), and use one of them as the pivot 
string for building the hierarchy and guiding the search. In the preprocessing 
phase the effect is that the tree has only 0{njr) nodes (assuming b = 1). At the 
search time, we evaluate the distance between the query string and the pivot 
as before, but at the same time, without any additional cost, we evaluate r — 1 
other distances. For these r — 1 other distances we just check if they are close 
enough to the query (this can be done in parallel in 0(1) time), but do not use 
them for any other purpose. We call this algorithm BPBKT. 

We have implemented the algorithms in C/C-l— I- and run experiments in 
2GHz Pentium 4. We used a dictionary of 98580 English words for the experi- 
ments. We selected 10,000 query words from the dictionary. For ABKT we used 
b = 1000, for BPBKT r = 8 (and w = 128) and for EBPBKT £ = 1. The 
average number of distance evaluations / total query time in seconds for k = 1 
were 2387 / 20.58 (BKT), 495 / 14.74 (ABKT), 729 / 8.93 (BPBKT) and 583 / 
7.09 (EBPBKT). The ratio between the performance of the algorithms remained 
approximately the same for k = 1..4. 
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1 Introduction 

The String B-tree [2] due to Ferragina and Grossi is a well-known external- 
memory index data structure which handles arbitrarily long strings and performs 
search efficiently. It is essentially a combination of B+-trees and Patricia tries. 
From a high-level point of view, the String B-tree of a string T of length N is 
a B~'"-tree, where the keys are pointers to the suffixes of string T, and they are 
sorted in lexicographically increasing order of the suffixes. A Patricia trie is used 
for each node of the String B-tree. By plugging in Patricia tries at nodes, the 
branch/search/update operations can be carried out efficiently. Due to Patricia 
tries, however, the String B-tree is rather heavy and complex. 

In this paper we propose a new implementation of the String B-tree, which 
is simpler and easier to implement than the original String B-tree, and that 
supports as efficient search as the original String B-tree. Instead of a Patricia 
trie, each node contains an array, Icpi, of integers and an array, Inci, of characters. 
Once the number of keys in a node is given, arrays Icpi and InCi occupy a fixed 
space, while the space required for a Patricia trie can vary within a constant 
factor. Because arrays are simple and occupy a fixed space, they are easy to 
handle and suitable for external-memory data structure. 

We present an efficient branching algorithm at a node that uses only the two 
arrays Icpi and InCi. Ferguson [1] gave a branching algorithm for binary strings. 
We extend this algorithm so as to do the branch operation efficiently for strings 
over a general alphabet. The search algorithm based on our branching algorithm 
requires 0(log5 N + disk accesses as in the original String B-tree, where 

M is the length of a pattern, occ is the number of occurrences, and B is the disk 
page size. The branching algorithm can be also used as a basic algorithm of the 
insertion, deletion, and construction algorithms of the original String B-tree. 

2 Branching Algorithm 

Let T be a string of length N over an alphabet S, which is stored in external 
memory. We denote the fth character of string T by T[i\. We define the fth sujjix 
as the substring T[i]T[i + \] ■ ■ ■ T[N] and call the index i a suffix pointer. Given 
two strings a and /3, a ^ /3 if a is lexicographically smaller than /?, and a < (3 

* Work supported by IMT 2000 Project AB02. 
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if a is lexicographically smaller than or equal to f3. We denote by lcp{a, (3) the 
length of the longest common prefix (Icp) of a and /?. 

We explain additional data stored in nodes. Let so,si,...,s„ be suffixes 
represented by suffix pointers stored in a node v, where sq ^ si ^ ^ s„. For 

efficient branching, we store additional data Icpi and Inci in node v, which are 
defined as follows for 1 < i < n: 

- Icpi = lcp{si-i,Si), and 

— InCi = Si[lcpi + 1 ]. 

Given a pattern P of length M, the branching algorithm finds an index j such 
that Sj-i ^ P d: Sj at node v. Let L be the value of max (Zcp(P, sq), lcp{P, s„)). 
At node v, the algorithm requires that: 

Cl. So ^ P ^ Sn, and 

C2. L is given as a input parameter and L < M. 

When the branching algorithm is used as a basic operation of the search in the 
String B-tree, the above conditions are satisfied. We omit the details. For sim- 
plicity, we define P[i] as an empty character for i > M, which is lexicographically 
smaller than any other character in S. 

The algorithm consists of the following three stages. 

Stage A: Find the suffix Sx using Icpi and InCi such that Sx is one of suffixes 
that share the longest Icp with P among suffixes stored in node v. 

Starting with f = 1, we scan arrays Icpi and InCi from left to right and 
maintain x as the desired index inductively. We initialize x to 0. At step i, 
we compare P[/-|-l] with Inci (i.e., where I = Icpi. If Inci ^ P[l-|-1], 

then we set x to i and increase i by one. Otherwise, we increase i until 
Icpi < 1. We repeat this process until i reaches n. Then, Sx is the desired 
suffix. 

Stage B: Find the value of lcp{P, Sx)- 

We load disk pages where Sx is stored and compare P with Sx character- 
by-character from the (L -I- l)st character. As a result, we get the value of 
lcp{P, Sx)- Let L' be the value of lcp{P, Sx)- In this stage, we access 0{ ^ 
disk pages. 

Stage C: Find the index j such that sj-i P ^ sj. 

If P ^ Sx, then we decrease i until Icpi < L' , starting with i = x.li P >- Sx, 
then, we increase i until Icpi < L' , starting with z = a; -I- 1. Then, the value 
of z is the desired index j. 
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One degree of freedom not usually exploited in developing high-performance 
text-processing algorithms is the encoding of the underlying atomic character 
set. Here we consider a text compression method where the specific charac- 
ter set collating-sequence employed in encoding the text has a big impact on 
performance. We demonstrate that permuting the standard character collating- 
sequences yields a small win on Asian-language texts over gzip. We also show im- 
proved compression with our method for English texts, although not by enough 
to beat standard methods. However, we also design a class of artificial languages 
on which our method clearly beats gzip, often by an order of magnitude. 

1 Differential Encoding 

Differential coding is a common preprocessing step for compressing numerical 
data associated with sampled signals and other time series streams. The temporal 
coherence of such signals implies that the value at time ti likely differs little from 
that at ti+\. Thus representing the signal as an initial value followed a stream of 
difference (i.e. U+i—ti for 0 < i < n) should consist primarily of small differences. 
Such streams should be more compressible using standard techniques like run- 
length encoding, Huffman coding, and gzip than the original data stream. 

The most relevant previous work is [1], where alphabet permutation was 
employed to improve the performance of compression algorithms based on the 
Burrows- Wheeler transform. 

2 Experiments on English Texts 

The key to successful differential encoding lies in identifying the best collating 
sequence. We seek the circular n-permutation tt which minimizes 

n n 

) 

TTEii ' ' ' ' 

i=i i=i 

where p{i,j) is the probability that symbol j immediately follows symbol i, i.e. 
p{hj) = -P(j|*)) Etnd d{i,j) is the shortest “distance” from i to j around the 
circular permutation. Thus d{i,j) = min(|j — i\,n— \j — i|). 
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To estimate the conditional character-probabilities for the optimized collating 
sequence for English text, we used letter-pair (bigram) frequencies derived from 
a large corpus of text including the famous Brown corpus. The Discropt system 
was run for 10 hours optimizing the permutation over these frequencies, resulting 
in the following collating sequence: 

.VGWCDINHET’ ’SAROLFMPUYBJQZXK 

We compared differential compression using both the standard and opti- 
mized collating sequence, with both standard Huffman codes and gzip employed 
for encoding. The permuted collating sequence typically reduces the size of the 
Huffman-encoded differential sequences by 3-4%, and gzip-encoded differential 
sequences by about 1% - however, both encoding algorithms work substantially 
better on the original text instead of the differential text. 

3 Experiments on Asian-Language Texts 

We reasoned that differential encoding might perform better on Asian-language 
texts, because the larger size of the alphabet makes such texts more closely re- 
semble quantized signals. We experimented on Chinese, Japanese, and Korean 
UNICODE texts with both 8-bit and 16-bit recoded alphabets. The 8-bit alpha- 
bet permutation produced worse results than the original alphabet encoding for 
both gzip and Huffman codes, but permuting the full 16-bit alphabet encoding 
did permit the differential gzip encoding to beat the conventional gzip encodings 
by 1-2% on almost all files. 

4 Experiments on Martian-Language Texts 

To demonstrate that gzip can be significantly beaten via differential encoding, 
we define a class of artificial languages which we will call Martian. Martian 
words evolve in families. Each family is defined by a length-(^ — 1) sequence of 
differences from 0 to a — 1, where a = \S\ for alphabet S. There are a distinct 
length-? words in each family, formed by prepending each a € E to the difference 
sequence. For example, for E = {a,...,z} the family (-1-2, -|-3, — 6) defines the 
words acfz, bdga, cehb, and so forth. 

We achieve our greatest improvement in differentially encoding relatively 
short Martian texts drawn from large families of long words. We demonstrated 
that differential encoded gzip results in 5.8 times better compression than plain- 
text gzip on files from 2500 to 50,000 words for 20 families of 20-character words. 
Even more extreme performance is obtainable by further lengthening the words. 
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Abstract. A space-efficient linear-time approximation algorithm for the 
grammar-based compression problem, which requests for a given string 
to find a smallest context-free grammar deriving the string, is presented. 
The algorithm consumes only 0{g* logp,) space and achieves the worst- 
case approximation ratio 0(log(;* logn), with the size n of an input and 
the optimum grammar size g*. Experimental results for typical bench- 
marks demonstrate that our algorithm is practical and efficient. 



1 Introduction 

The grammar-based compression problem is to find a smallest context-free gram- 
mar that generates only the given string. Such a CFG requires that every nonter- 
minal is derived from only one production rule, say, deterministic. The problem 
deeply relates to factoring problems for strings, and the complexity of similar 
minimization problems have been rigorously studied. For example, Storer [20] 
introduced a factorization for a given string and showed the problem is NP-hard. 
De Agostino and Storer [2] defined several online variants and proved that those 
are also NP-hard. 

As non-approximability results, Lehman and Shelat [12] showed that the 
problem is APX-hard, i.e. it is hard to approximate this problem within a con- 
stant factor (see [1] for definitions). They also mentioned its interesting connec- 
tion to the semi- numerical problem [9], which is an algebraic problem of min- 
imizing the number of different multiplications to compute the given integers 
and has no known polynomial-time approximation algorithm achieving a ratio 
o(log n/ log log n) . Since the problem is a special case of the grammar-based com- 
pression, an approximation better than this ratio seems to be computationally 
hard. 

On the other hand, various practical algorithms for the grammar-based com- 
pression have been devised so far. LZ78 [24] including LZW [21], and BISEC- 
TION [8] are considered as algorithms that computes straight-line programs, 
CFGs formed from Chomsky normal form formulas. Also algorithms for re- 
stricted CFGs have been presented in [6, 10, 14, 15,22]. Lehman and Shelat [12] 
proved the upper bounds of the approximation ratio of these practical algo- 
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rithms, as well as the lower bounds with the worst-case instances. For ex- 
ample, BISECTION algorithm achieves an approximation ratio no more than 
0((n/ log All those ratios, including the lower-bounds, are larger than 
0(log n). 

Recently polynomial-time approximation algorithms for the grammar-based 
compression problem have been widely studied and the worst-case approximation 
ratio has been improved. The first log n-approximation algorithm was developed 
by Charikar et al. [3]. Their algorithm guarantees the ratio 0{\og{n/ g^)), where 
is the size of a minimum deterministic CFG for an input. Independently, Ryt- 
ter presented in [16] another 0 (log(n/( 7 *))-approximation algorithm that em- 
ploys a suffix tree and the LZ-factorization technique for strings. Sakamoto also 
proposed in [17] a simple linear-time algorithm based on Re-pair [10] and achiev- 
ing ratio O(logn); Now this ratio has been improved to 0{\og{n/ g^)). 

The ratio 0(log(n/5*)) achieved by these new algorithms is theoretically 
sufficiently small. However, all these algorithms require 0(n) space, and it pre- 
vents us to apply the algorithms to huge texts, which is crucial to obtain a good 
compression ratio in practice. For example, the algorithm Re-pair [10] spends 
5n -I- space on unit-cost RAM with the input size n. 

This state motivates us to develop a linear-time, sub-linear space O(logn)- 
approximation algorithm for grammar-based compression. We present a simple 
algorithm that repeats substituting one new nonterminal symbol to all the same 
and non-overlapping two contiguous symbols occurring in the string. This is car- 
ried out by utilizing idea of the lowest common ancestor of balanced binary trees, 
and no real special data structure, such as suffix tree or occurrence frequency 
table, is requested. In consequence, the space complexity of our algorithm is 
nearly equal to the total number of created nonterminal symbols, each of which 
corresponds to a production rule in Chomsky normal form. 

The size of the final dictionary of the rules is proved by LZ-factorization and 
its compactness [16]. Our algorithm runs in linear-time with 0(g* logg*) space, 
and guarantees the worst-case approximation ratio 0(logg*logn) on unit-cost 
RAM model. The memory space is devoted to the dictionary that maps a con- 
tiguous pair of symbols to a nonterminal. Practically, in randomized model, 
space complexity can be reduced to 0((/*log(/*) by using a hash table for the 
dictionary. In the framework of dictionary-based compression, the lower-bound 
of memory space is usually estimated by the size of a possible smallest dictio- 
nary, and thus our algorithm is nearly optimal in space complexity. Compared 
to other practical dictionary-based compression algorithms, such as LZ78, which 
achieves the ratio l7(n^/^/logn), the lower-bound of memory space of our al- 
gorithm is considered to be sufficiently small. We confirm practical efficiency of 
our algorithm by computational experiments on several benchmark texts. 

The remaining part of this paper is organized as follows. In Section 2, we 
prepare the definitions related to the grammar-based compression. In Section 3, 
we introduce the notion of lowest common ancestors in a complete binary tree 
defined by alphabet symbols. Using this notion, our algorithm decides a fixed 
priority of all pairs appearing in a current string and replaces them according 
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to the priority. The algorithm is presented in Section 4 and we analyze the 
approximation ratio and estimate the space efficiency. In Section 5, we show 
the experimental results by applying our algorithm to typical benchmarks. In 
Section 6, we summarize this study. 

2 Notions and Definitions 

We assume a finite alphabet for the symbols forming input strings throughout 
this paper. Let S he & finite alphabet. The set of all strings over E is denoted 
by E* , and E'^ denotes the set of all strings of length i. The length of a string 
w & E* \s denoted by and also for a set S, the notion [S'! refers to the size 
(cardinality) of S. The ith symbol of w is denoted by w[i]. For an interval [i,j] 
with I < i < j < \w\, the substring of w from w[i] to w[j] is denoted by w[i,j]. 

A repetition is a string for some x G E and some positive integer k. 
A repetition w[i,j] in w of a symbol a; G A is maximal if w[i — 1] yf a; and 
w[j + 1] yf X. It is simply referred by a;"*" if there is no ambiguity in its interval 
in w. Intervals [z, j] and [i',j'] with i < i' are overlapping if i' < j < j', and 
are independent if j < i' . A pair zz G A* is a string of length two, and an 
interval [z, z + 1] is a segment of u in w if zc[z, z + 1] = zz. Two segments [i — 1, z] 
and [z + l,z + 2] are said to be the left segment and the right segment of [i,j], 
respectively. 

A context-free grammar {CFG) is a quadruple G = {E, N, P, s) of disjoint 
finite alphabets A and N, a finite set P C N x (A^U A)* of production rules, and 
the start symbol s G N. Symbols in N are called nonterminals. A production 

rule a ^ b\ bk 'va P derives /3 G (A U N)* from a G (A U N)* by replacing 

an occurrence of a G in a with bi bk- hn this paper, we assume that any 

CFG is deterministic, that is, for each nontermial a G N , exactly one production 
rule from a is in P. Thus, the language L{G) defined by G is a singleton set. We 
say a CFG G derives zz> G A* if L{G) = {zz;}. The size of G is the total length 
of strings in the right hand sides of all production rules, and is denoted by |G|. 

The aim of grammar-based compression is formalized as a combinatorial 
optimization problem, as follows: 

Problem 1 Grammar-Based Compression 
Instance; A string w G A*. 

Solution; A deterministic CFG G that derives w. 

Measure; The size ofG. 

From now on, we assume that every deterministic CFG is in Chomsky normal 
form, i.e. the size of strings in the right-hand side of production rules is two, and 
we use I A^l for the size of a CFG. Note that for any CFG G there is an equivalent 
CFG G' in Chomsky normal form whose size is no more than 2 • |G|. 

It is known that there is an important relation between a deterministic CFG 
and the factorization. The L Z -factorization LZ{w) of w is the decomposition of 

w into /i fk, where /i = zz>[l], and for each I < i < k, fi is the longest prefix 

of the suffix zz>[|/i • • • fi-i \ -\- 1, |zz;|] that appears in /i • • • fi-i. Each fi is called 
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a, factor. The size \LZ{w)\ of LZ{w) is the number of its factors. The following 
result is used in the analysis of the approximation ratio of our algorithm. 

Theorem 1 ([ 16 ]). For any string w and its deterministic CFG G, the inequal- 
ity \LZ{w)\ < |G| holds. 

3 Compression by the Alphabetical Order 

In this section we describe the central idea of our grammar-based compression 
utilizing information only available from individual symbols. The aim is to min- 
imize the number of different nonterminals generated by our algorithm. 

A replacement [i, z-l-1] ^ a for w is an operation that replaces a pair w[i, z-l-1] 
with a nonterminal a € N. A set R of replacements is, by assuming some order 
on R, regarded as an operation that performs a series of replacements to w. In 
the following we introduce a definition of a set of replacements whose effect on 
a string is independent of the order. 

Definition 1 . A set R of replacements for w is appropriate if it satisfies the 
following: (1) At most one of two overlapping segments [z, z-l-1] and [z -I- 1, z -I- 2] 
is replaced by replacements in R, (2) At least one of three overlapping segments 
[z, z-l-1], [z -I- 1, z -I- 2] and [z -I- 2, z -|- 3] is replaced by replacements in R, and (3) 
For any pair of replacements [z, z -I- 1] ^ a and [j,j -I- 1] — > 6 in i?, a = & if and 
only if zc[z,z -|- 1] = w[j,j + 1]. 

Clearly, for any string w, an appropriate replacement R for w generates the 
string w' uniquely. In such a case, we say that R generates w' from w, and write 
w' = R{w). Now we consider the following problem: 

Problem 2 Minimum Appropriate Replacement 
Instance: A string w. 

Solution: An appropriate replacement R for w. 

Measure: The number of kinds of nonterminals newly introduced by R. 

Here we explain the strategies for making pairs in our algorithm. Let d be a 
positive integer, and let k be [log 2 d\ . An alphabet tree Td for E = {oi, . . . , ad} is 
the rooted, ordered complete binary tree whose leaves are labeled with 1, . . . , 2^ 
from left to right. The height of an internal node refers to the number of edges 
of a path from the node to a descendant leaf. Let lca{i,j)d denote the height of 
the lowest common ancestor of the leaves z and j. For the simplicity, we omit 
the index d and use lca{i,j) if there is no ambiguity. 

Definition 2. Let A be a finite alphabet with a fixed order. A string a G A* 
is increasing if the symbols in a are in increasing order, and is decreasing if the 
symbols are decreasing with the order of A. A string is monotonic if it is either 
increasing or decreasing. 

By using the above notion, we factorize a string w G A* into the sequence 
Wi, . . . ,Wn of monotonic strings uniquely, as follows: wi is the longest and 
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monotonic prefix of w, and if wi,.. . ,Wi are decided, then Wi+i is the longest 
and monotonic prefix of the string w' = w[|'u;i • • -rcil + 1 , |w|]. The sequence 
wi, . . . ,Wn is called the E -factoring of w. 

Definition 3 . Let wi, . . . , be the if-factoring of w. A pair a - 6 is a boundary 
pair between wj and wj+i if a is the rightmost symbol of Wj and b is the leftmost 
symbol of wj+i. 

The first idea for Minimum Appropriate Replacement is to replace all 
the boundary pairs. Let w be a string over an alphabet E' and a a substring in 
w appearing at least twice. If a is longer than 2 |i 7 '|, then it contains at least two 
boundary pairs. Let br and bn be the leftmost and the rightmost boundary pairs 
in a, respectively, and let [i,i + 1] and [j,j + 1] be the corresponding segments 
of bn and bn- Then we can write a = X ■ a[i, i-\-l]-Y ■ a[j, j -\-V\- Z with strings 
X, Y and Z . Let R be an appropriate replacement that replaces all the boundary 
pairs (and other remained pairs, for example by left-to-right scheme) in Y . In 
any occurrence of a, the substring o;[i, i + I] • F • a[j, j + I] is uniquely replaced 
by R. Thus, for any two occurrences of a, the differences of their replacement by 
R occur only in X and Z. Notice that |A| and |F| are bounded by the current 
alphabet size k = \E'\. Next, we reduce the length of such X and Y to Oilogk) 
by another technique. 

Definition 4 . Let w be a string in E*, and let — 1 , z + 2 ] = be 

a monotonic substring of w with , a jg , a G E. If Aa(ji, J2) < Ica{j2,j3) 

and ^ca(j2, js) > ^ca(j3, j'4), then the pair w[i,i-\- 1] is called a locally maximum 
pair. 

Our second idea is to replace all locally maximum pairs. Since any locally 
maximum pair shares no symbol with neither other locally maximum pairs nor 
boundary pairs, all boundary pairs and locally maximum pairs in w can be 
included in an appropriate replacement R. Assume a substring having no locally 
maximum pair. The length of such a string is O(logfc), where k is the height of 
the tree Td, because there are at most log A: different values of lca{i,j). Thus, 
any two occurrences of a are replaced by R with the same string except their 
prefices and suffices of length at most 0 (log k). If a string consists of only short 
A-factors, then there may be no locally maximum pairs in the string. Therefore, 
not only locally maximum pairs but also the boundary pairs are necessary. In 
the next section we describe the algorithm utilizing the ideas given above. 

4 Algorithm and Analysis 

In this section we introduce an approximation algorithm for the grammar-based 
compression problem and analyze its approximation ratio to the optimum as 
well as its space efficiency. 



4.1 Algorithm LCA 

The algorithm LCA(zc) is presented in Fig. 1 . We describe the outline of LCA(zc) 
below. 
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1. Algorithm LCA(i/;) 

2. m = 0; 

3. Initialize the mth dictionary Dm = 0; 

4. Replace all maximal repetitions w[i,i + j\ by A(^aj) and add 

5. to Dm, where C'(a,j) and their productions are recursively defined below; 

6. For each i = 1, . . . , |ii;| — 2 do: 

7. If the pair w[i,i + 1] is boundary or locally maximum, then 

8. replace w[i, i + 1] by a consistent Ak', 

9. Dm ^ {‘Ak ^ w[i,i + ly}-, 

10. else 

11. replace u;[i] and w[i -I- l,i -I- 2] by consistent nonterminals Ak and Ak+i', 

12. Dm ^ {‘Ak w[i]’, ‘Ak+i ^ w[i + l,i + 2]’}; 

13. Increment m; 

14. Goto 3. until all pairs in w are mutually different; 

15. Output D U {S ^ w} for D = Do U • • • U Dm', 



B(a,j)C(a,j) — 



- 4 (a,j 72 ). ifi>4iseven 
A(^a,j-i) ' 0-, if i > 3 is odd 
a^, otherwise 



Fig. 1. The approximation algorithm for grammar-based compression. A segment 
w[i,i + l] must be replaced by a nonterminal consistent with a current dictionary Dm, 
i.e. w\i, i + 1] is replaced by A if a production A BC {BC = i + 1]) is already 
registered to Dm and a new nonterminal is created to replace w[i, i + 1] otherwise. 



Phase 1 (Line 4 — 5): The algorithm finds out all maximum repetitions and 
replace them with nonterminal symbols. As a result, a maximal repetition a"*" 
will be divided into two strings. This process continues until the length of any 
repetition becomes two or less. 



Phase 2 (Line 6 12): Since all repetitions have been already removed, every 

A-factor has length at least two, and boundary and locally maximum pairs do 
not overlap each other. Obviously, the algorithm will find such an appropriate set 
R of replacements. Then according to R, the algorithm replaces w, and add all 
productions in R to the current dictionary Dm- Note that any symbol in w will 
be replaced by an operation in either line 8 or 11: this trick plays an important 
role to reduce space complexity. 



Phase 3 (Line 14 15): The algorithm repeats the above steps until all pairs 

in the current string become being mutually different, and then outputs the final 
dictionary. 

Since the algorithm replaces either w[i,i -I- 1] or w[i -I- l,f -b 2], or both, 
the outer loop in LCA(w) repeats at most 0(log |w|) times. Moreover, for each 
iteration of the outer loop, the length of a string becomes at least 2/3 times the 
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previous one. We can verify whether a segment is locally maximum by its lea in 
0(1) time^. Thus, LCA(w) runs in linear time with |w|. 

Theorem 2. The running time of LCA(w) is bounded by Odicj). 

4.2 Performance Analysis 

Lemma 1. Let ru be a string in E*, and let [/, r] and [I' , r'] be intervals of w of 
the same substring a = w[l,r] = w[l' ,r']. Let R be the appropriate replacement 
for w specified by the dictionary D produced by LCA(w). Then, for each index 
[log I A|] + 1 < z < |a| — [log I i7|] , a replacement “[I + i — 1,1 + i] ^ a” with 
some a G A is in i? if and only if “[I' + i — 1,1' + i] a” is in R. 

Proof. All locally maximum and boundary pairs are independent of each other. 
Thus, R contains all locally maximum and boundary pairs in w. Every segment 
between those pairs is replaced by a nonterminal as to maintain consistency with 
the up-to-date dictionary, by the left-to-right scheme. A prefix and a suffix of a 
having no locally maximum pair are no longer than [log | A|] . Thus, any pair of 
the zth segments in w[l,r] and w[V ,r'], except the first [log|L7|] segments and 
the last [log I A |] ones, is going to be replaced by the same nonterminal, or is 
not going to be replaced. □ 

Theorem 3. The worst-case approximation ratio of the size of a grammar pro- 
duced by the algorithm LCA to the size of a minimum grammar is 0(log (/* log n), 
where is the size of a minimum grammar. 

Proof. We first estimate the number of different nonterminals produced by an 
appropriate replacement R for an input string w € E*. Let be the size of 
a minimum grammar for w, and let Wi ■ ■ ■ Wm be the LZ-factorization of w. 
We denote by the number of different nonterminals produced by R. 

From the definition of LZ-factorization, any factor Wi occurs in 
or |wi| = 1. With lemma 1, any factor Wi and its left-most occurrence are 
compressed into almost the same strings a/Jy and such that lay] and |o;'y'| 

are 0(log |A|). Thus, we can estimate #{w)r = #(wi • • • Wm-i)fi + 0(log | A|) = 
0(mlog|A|) = 0 ( 5 * log I A|). Therefore, we can apply the above estimation for 
the occurrences of f) whenever \(3\ > 2. Since E is initially a constant alphabet, 
ff{w)n converges to log g*). Hence, 0((/* log (/*) is the maximum number of 
different nonterminals produced by a set of appropriate replacement by LCA(w). 
The main loop of LCA(w) is executed at most O(logn) times. Therefore, the 
total number of different nonterminals produced by LCA(w) is 0(g* log (/* logn). 
This derives the approximation ratio. 

The memory space required by LCA(w) can be bounded by the size of data 
structure to answer the membership query: input is a pair AiAj', output is an 
integer k if Ak — > AiAj is already created and no otherwise. By Theorem 3, 

^ We can get the lea of any two leaves i and j of complete binary trees by an xor 
operation between binary numbers in 0(1) time under our RAM model [5]. 
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the size of a current dictionary Dm is bounded by 0((;* log (/♦) for each m > 0. 
Moreover, each symbol Ai in a current string is replaced by a rule of the form 
Aj Ai or Aj ^ YZ, where Ai G {Y, Z}. Thus, 0((g* log (/*)^)-space algorithm 
is obtained by a naive implementation. Finally we show that the memory space 
can be improved to 0(g* logt;*). 

4.3 Improving the Space Efficiency 

An idea for improving space complexity of the algorithm is to recycle nonter- 
minals created in the preceding iteration. Let D(a) be the string obtained by 
applying a dictionary D to a string a. Let Di and D 2 be dictionaries such that 
any symbol in w is replaced by D\ and any symbol in Di(w) is replaced by 
D 2 - Then, the decoding of the string D 2 {Di(w)) is uniquely determined, even if 
D 2 reuses nonterminals in Di like “A ^ AS.” Thus, we consider that the final 
dictionary D is composed of Z?i, . . . , Dm, where Di is the dictionary constructed 
in the ith iteration. Since any symbol is replaced with a nonterminal in line 
8 or 11 in the algorithm, the decoding is unique and Dm{- • ■ Di{w') • • •) = w for 
the final string w' . Such a dictionary is computed by the following function and 
data structures. 

Let Di be the set of productions, Ni the set of alphabet symbols created in 
the fth iteration, and ki the cardinality |A^i|. We define the function fi{x,y) = 
{x — l)ki + y ior 1 < x,y < ki. This is a one-to-one mapping from {1, . . . , ki\ x 
{1, . . . , /ci} to {1, . . . ,kf}, and is used to decide an index of a new nonterminal 
associated to a pair A^Ay, where A^ denotes the a;th created nonterminal in Ni. 

The next dictionary is constructed from Di, Ni, and fi as follows. In 

the algorithm LCA, there are two cases of replacements: one is for replacements 
of pairs, and the other is for replacements of individual symbols in line 11. We 
first explain the case of replacements of pairs. Let a pair A^Ay in a current string 
be decided to be replaced. The algorithm LCA computes the integer z = f{x, y), 
and looks up a hash table H for z. If H{z) is absent and W = {Ai, . . . , A^}, 
then set Ni = Ni\J {Afe+i}, Di = DiU {A^+i ^ A^Ay}, H{z) = k + 1, and 
replace the pair A„,Ay with Ak+\. If iL(z) = fc -I- 1 is present, then only replace 
the pair A^Aj, by A^+i. We next explain the case of replacements of individual 
symbols. Since all maximal repetitions like A"*" are removed in line 4-5, there is 
no pair like AA in a current string. Thus, for a replacement of a single symbol 
Ax, we can use the nonterminal A^+i such that z = fi{x, x) and H{z) = k + 1. 
The dictionary Di constructed in the fth iteration can be divided to Hq and 
such that Uq is the dictionary for repetitions and = Di \ Di^ . Thus, we 
can create all productions without collisions, and decode a current string Wi+i 
to the previous string Wi by the manner Di{wi+i) = Hq (iCi+i)) = Wi. 

Theorem 4. The space required by LCA(w) is 0(g* logg*). 

Proof. Let n be the size |w| and m the number of iterations of outer loops 
executed in LCA('u;) . By theorem 3, the number | | of new nonterminals created 

in the zth iteration is 0((7*logg*) for each z < m. To decide the index of a 
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Table 1. Result for the canterbury corpus. 



File 


Category 


Size 

(byte) 


Repeat 

times 


Size of 
w 


Size of 
D 


Compressed 
size (bytes) 


gzip 

(bytes) 


alice29.txt 


English text 


152089 


6 


5053 


45243 


176956 


54435 


asyoulik.txt 


Shakespeare 


125179 


7 


2523 


42562 


156220 


48951 


cp.html 


HTML source 


24603 


7 


470 


9010 


28980 


7999 


fields. c 


C source 


11150 


9 


71 


4418 


12616 


3143 


grammar. Isp 


LISP source 


3721 


6 


122 


1730 


4408 


1246 


kennedy.xls 


Excel Spreadsheet 


1029744 


5 


41600 


139118 


980832 


206779 


lcetl0.txt 


Technical writing 


426754 


7 


8167 


113760 


477268 


144885 


plrabnl2.txt 


Poetry 


481861 


7 


9779 


138353 


593988 


195208 


ptt5 


CCITT test set 


513216 


6 


2759 


40784 


154836 


56443 


sum 


SPARC Executable 


38240 


11 


77 


14349 


46260 


12924 


xargs.l 


GNU manual page 


4227 


5 


262 


2122 


5804 


1756 



new nonterminal from a pair A^Ay, LCA computes z = fi{x,y), H(z), and 
k = |iVj| for the current Ni. Since \z\ < O(logn) and the number of different z 
is 0((7*log5»), the space for H is 0 ( 5 * log g*) and k = 0(5*log(/*). Thus, the 
construction of Di requires only 0(g*logg*) space. We can release whole the 
memory space for Di in the next loop. Hence, the total size of the space for 
constructing D is also 0((/*logg*). 



5 Experiments 

To estimate the performance of LCA(w), we implemented the algorithm and 
tested it. We used a PC with Intel Xeon 3.06GHz dual-CPU and 3.5GB memory 
running Cygwin on Windows XP, and used Gcc version 3.3.1 for the implemen- 
tation. We used the canterbury corpus and the artificial corpus, which are from 
the Canterbury Corpus (http://corpus.canterbury.ac.nz/). 

Tables 1 and 2. show the results for each corpus. In the tables, ‘Repeat 
times’ means how many times the algorithm processed the lines from 3 to 11 in 
Fig. 1. Note that it corresponds to the height of the syntax tree of the grammar 
generated by the algorithm. ‘Size of w’’ and ‘Size of D' indicate respectively the 
length of the sequence w and the total number of rules in the dictionary D, 
which are obtained at the last. ‘Compressed size’ indicates the whole output 
file size, where w and D are encoded in a simple way: each element in them 
is represented by an integer with the smallest length of bits so that it can be 
distinguished from the others. 

As we see from Table 1, LCA(w) gives rather worse compression ratios than 
gzip. One of the reasons is because of our poor way of encoding. Another main 
reason is that D becomes very large when a target text is long and has few 
repetitions. If we thin out useless rules in D like Sequitur algorithm [15] and 
apply more efficient encodings, it can be improved. On the other hand, from 
Table 2, we see that there are cases that LCA(w) is better than gzip, because 
that the texts have many repetitions. 





















A Space-Saving Linear-Time Algorithm for Grammar-Based Compression 227 



Table 2. Result for the artificial corpus. 



File 


Category 


Size 

(byte) 


Repeat 

times 


Size of 
w 


Size of 
D 


Compressed 
size (bytes) 


gzip 

(bytes) 


a.txt 


The letter ’a’ 


1 


1 


1 


1 


16 


27 


aaa.txt 


The letter ’a’, re- 
peated 100,000 times 


100000 


1 


1 


22 


64 


141 


alphabet.txt 


Enough repetitions of 
the alphabet to fill 
100,000 characters 


100000 


6 


3 


77 


136 


315 


random.txt 


100,000 charac- 
ters, randomly 

selected from 

[a-2|A-Z|0-9|!| ] 
(alphabet size 64) 


100000 


3 


17696 


42500 


191672 


75689 



Table 3. Maximum memory consumption for the dictionary. 



text size (bytes) 


Sequitur (bytes) 


LCA(w) (bytes) 


10 


132 


2056 


100 


1344 


2056 


1000 


14352 


6740 


10000 


98796 


46800 


100000 


726828 


428480 


1000000 


6903468 


2880300 



Table 4. Compression time. 



text size (bytes) 


Sequitur (s) 


LCA(u;) (s) 


10 


0.030 


0.093 


100 


0.030 


0.077 


1000 


0.046 


0.061 


10000 


0.061 


0.077 


100000 


0.390 


0.186 


1000000 


4.311 


0.874 



Our algorithm runs in 0(n) time and the size of dictionary is bounded by 
0((7*logg*) in average. To estimate time and space efficiency, we compared 
LCA(t(;) with Sequitur (Tables 3 and 4). We used random texts with |i7| = 26 
as target texts. Since the memory consumptions of both algorithms increase and 
decrease during running, we measured the maximum memory consumptions. The 
results show that LCA(w) is superior to Sequitur when the text is sufficiently 
long. 



6 Conclusion 

We presented a linear-time algorithm for the grammar-based compression. This 
algorithm guarantees the approximation ratio 0(logg*logn) and the memory 
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space 0((/* log (?♦). This space bound is considered to be sufficiently small since 
space is a lower bound for non-adaptive dictionary-based compression. In 
particular, the upper bound of memory space is best in the previously known 
linear-time poZj/Zo^-approximation algorithms. We also show the scalability of our 
algorithm for large text data. 

References 

1. G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, and 
M. Protasi. Complexity and Approximation: Combinatorial Optimization Problems 
and Their Approximability Properties. Springer, 1999. 

2. S. De Agostino and J. A. Storer. On-Line versus Off-Line Computation in Dynamic 
Text Compression. Inform. Process. Lett., 59:169-174, 1996. 

3. M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Rasala, A. Sa- 
hai, and A. Shelat. Approximating the Smallest Grammar: Kolmogorov Complex- 
ity in Natural Models. In Proc. 29th Ann. Sympo. on Theory of Computing, 792- 
801, 2002. 

4. M. Farach. Optimal Suffix Tree Construction with Large Alphabets. In Proc. 38th 
Ann. Sympo. on Foundations of Computer Science, 137-143, 1997. 

5. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Computer Science and 
Computational Biology. Cambridge University Press, 1997. 

6. T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Collage System: 
a Unifying Framework for Compressed Pattern Matching. Theoret. Comput. Sci. 
(to appear). 

7. J. C. Kieffer and E.-H. Yang. Grammar-Based Codes: a New Class of Universal 
Lossless Source Codes. IEEE Trans, on Inform. Theory, 46(3):737-754, 2000. 

8. J. C. Kieffer, E.-H. Yang, G. Nelson, and P. Cosman. Universal Lossless Com- 
pression via Multilevel Pattern Matching. IEEE Trans. Inform. Theory, IT-46(4), 
1227-1245, 2000. 

9. D. Knuth. Seminumerical Algorithms. Addison- Wesley, 441-462, 1981. 

10. N. J. Larsson and A. Moffat. Offline Dictionary-Based Compression. Proceedings 
of the IEEE, 88(11):1722-1732, 2000. 

11. E. Lehman. Approximation Algorithms for Grammar-Based Compression. PhD 
thesis, MIT, 2002. 

12. E. Lehman and A. Shelat. Approximation Algorithms for Grammar-Based Com- 
pression. In Proc. 20th Ann. ACM-SIAM Sympo. on Discrete Algorithms, 205-212, 
2002 . 

13. M. Lothaire. Combinatorics on Words, volume 17 of Encyclopedia of Mathematics 
and Its Applications. Addison- Wesley, 1983. 

14. C. Nevill-Manning and I. Witten. Compression and Explanation Using Hierarchical 
Grammars. Computer Journal, 40(2/3):103-116, 1997. 

15. C. Nevill-Manning and I. Witten. Identifying hierarchical structure in sequences: 
a linear-time algorithm. J. Artificial Intelligence Research, 7:67-82, 1997. 

16. W. Rytter. Application of Lempel-Ziv Factorization to the Approximation of 
Grammar-Based Compression. In Proc. 13th Ann. Sympo. Combinatorial Pattern 
Matching, 20-31, 2002. 

17. H. Sakamoto. A Fully Linear-Time Approximation Algorithm for Grammar-Based 
Compression. Journal of Discrete Algorithms, to appear. 




A Space-Saving Linear-Time Algorithm for Grammar-Based Compression 229 



18. D. Salomon. Data compression: the complete reference. Springer, second edition, 
1998. 

19. J. Storer and T. Szymanski. Data compression via textual substitution. J. Assoe. 
Comput. Mach., 29(4):928-951, 1982. 

20. J. A. Storer and T. G. Szymanski. The Macro Model for Data Compression. In 
Proc. 10th Ann. Sympo. on Theory of Computing, pages 30-39, San Diego, Cali- 
fornia, 1978. ACM Press. 

21. T. A. Welch. A Technique for High Performance Data Compression. IEEE Corn- 
put, 17:8-19, 1984. 

22. E.-H. Yang and J. C. Kieffer. Efficient Universal Lossless Data Compression Al- 
gorithms Based on a Greedy Sequential Grammar Transform-Part One: without 
Context Models. IEEE Trans, on Inform. Theory, 46(3):755-777, 2000. 

23. J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. 
IEEE Trans, on Inform. Theory, IT-23(3):337-349, 1977. 

24. J. Ziv and A. Lempel. Compression of Individual Sequences via Variable-Rate 
Coding. IEEE Trans, on Inform. Theory, 24(5):530-536, 1978. 




Simple, Fast, and Efficient 
Natural Language Adaptive Compression* 



Nieves R. Brisaboa^, Antonio Farina^, Gonzalo Navarro^, and Jose R. Parama^ 

^ Database Lab., Univ. da Coruna, Facultade de Informatica 
Campus de Elvina s/n, 15071 A Coruna, Spain 
{brisaboa, f ari ,parama}@udc . es 

^ Center for Web Research, Dept, of Computer Science, Univ. de Chile 
Blanco Encalada 2120, Santiago, Chile 
gnavarro@dcc . uchile . cl 



Abstract. One of the most successful natural language compression 
methods is word-based Huffman. However, such a two-pass semi-static 
compressor is not well suited to many interesting real-time transmis- 
sion scenarios. A one-pass adaptive variant of Huffman exists, but it 
is character-oriented and rather complex. In this paper we implement 
word-based adaptive Huffman compression, showing that it obtains very 
competitive compression ratios. Then, we show how End-Tagged Dense 
Code, an alternative to word-based Huffman, can be turned into a faster 
and much simpler adaptive compression method which obtains almost 
the same compression ratios. 



1 Introduction 

Transmission of compressed data is usually composed of four processes: com- 
pression, transmission, reception, and decompression. The first two are carried 
out by a sender process and the last two by a receiver. This abstracts from com- 
munication over a network, but also from writing a compressed file to disk so 
as to load and decompress it later. In some scenarios, especially the latter, com- 
pression and transmission usually complete before reception and decompression 
start. 

There are several interesting real-time transmission scenarios, however, where 
those processes should take place concurrently. That is, the sender should be able 
to start the transmission of compressed data without preprocessing the whole 
text, and simultaneously the receiver should start reception and decompress the 
text as it arrives. Real-time transmission is usually of interest when communi- 
cating over a network. This kind of compression can be applied, for example, 
to interactive services such as remote login or talk/chat protocols, where small 
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in part (for the Spanish group) by MCyT (PGE and FEDER) grant (TIC2003-06593) 
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(P01-029-F), Mideplan, Chile. 
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messages are exchanged during the whole communication time. It might also 
be relevant to transmission of Web pages, so that the exchange of (relatively 
small) pages between a server and a client along the time enables adaptive com- 
pression by installing a browser plug-in to handle decompression. This might be 
also interesting for wireless communication with hand-held devices with little 
bandwidth and processing power. 

Real-time transmission is handled with so-called dynamic or adaptive com- 
pression techniques. These perform a single pass over the text (so they are also 
called one-pass) and begin compression and transmission as they read the data. 
Currently, the most widely used adaptive compression techniques belong to the 
Ziv-Lempel family [1]. When applied to natural language text, however, the 
compression ratios achieved by Ziv-Lempel are not that good (around 40%). 

Statistical two-pass techniques, on the other hand, use a semi-static model. A 
first pass over the text to compress gathers global statistical information, which 
is used to compress the text in a second pass. The computed model is transmitted 
prior to the compressed data, so that the receiver can use it for decompression. 
Classic Huffman code [11] is a well-known two-pass method. Its compression ratio 
is rather poor for natural language texts (around 60%). In recent years, however, 
new Huffman-based compression techniques for natural language have appeared, 
based on the idea of taking the words, not the characters, as the source symbols 
to be compressed [13]. Since in natural language texts the frequency distribution 
of words is much more biased than that of characters, the gain in compression is 
enormous, achieving compression ratios around 25%-30%. Additionally, since in 
Information Retrieval (IR) words are the atoms searched for, these compression 
schemes are well suited to IR tasks. Word-based Huffman variants focused on 
fast retrieval are presented in [7], where a byte- rather than bit-oriented coding 
alphabet speeds up decompression and search. 

Two-pass codes, unfortunately, are not suitable for real-time transmission. 
Hence, developing an adaptive compression technique with good compression ra- 
tios for natural language texts is a relevant problem. In [8, 9] a dynamic Huffman 
compression method was presented. This method was later improved in [12, 14]. 
In this case, the model is not previously computed nor transmitted, but rather 
computed and updated on the fly both by sender and receiver. 

However, those methods are character- rather than word-oriented, and thus 
their compression ratios on natural language are poor. Extending those algo- 
rithms to build a dynamic word-based Huffman method and evaluating its com- 
pression efficiency and processing cost is the first contribution of this paper. We 
show that the compression ratios achieved are in most cases just 0.06% over 
those of the semi-static version. The algorithm is also rather efficient: It com- 
presses 4 megabytes per second in our machine. On the other hand, it is rather 
complex to implement. 

Recently, a new word-based byte-oriented method called End-Tagged Dense 
Code (ETDC) was presented in [3]. ETDC is not based on Huffman at all. It 
is simpler and faster to build than Huffman codes, and its compression ratio 
is only 2%-4% over the corresponding word-based byte-oriented Huffman code. 
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For IR purposes, ETDC is especially interesting because it permits direct text 
searching, much as the Tagged Huffman variants developed in [7]. However, 
ETDC compresses better than those fast-searchable Huffman variants. 

The second contribution of this paper is to show another advantage of ETDC 
compared to Huffman codes. We show that an adaptive version of ETDC is much 
simpler to program and 22%-26% faster than word-oriented dynamic Huffman 
codes. Moreover, its compression ratios are only 0.06% over those of semi-static 
ETDC, and 2%-4% over semi-static Huffman code. From a theoretical viewpoint, 
dynamic Huffman complexity is proportional to the number of target symbols 
output, while dynamic ETDC complexity is proportional to the number of source 
symbol processed. The latter is never larger than the former, and the difference 
increases as more compression is obtained. 

As a sanity check, we also present empirical results comparing our dynamic 
word-based codes against two well-known compression techniques such as gzip 
(fast compression and decompression, but poor compression) and bzip2 (good 
compression ratio, but slower). These results show that our two techniques pro- 
vide a well balanced trade-off between compression ratio and speed. 

2 Word-Based Semi-static Codes 

Since in this paper we focus on word-based natural language text compression, 
we speak indistinctly of source symbols and words, and sometimes call vocabulary 
the set of source symbols. 

2.1 Word-Based Huffman Codes 

The idea of Huffman coding [11] is to compress the text by assigning shorter 
codes to more frequent symbols. Huffman algorithm obtains an optimal (shortest 
total length) prefix code for a given text. A code is a prefix code if no codeword is 
a prefix of any other codeword. A prefix code can be decoded without reference 
to future codewords, since the end of a codeword is immediately recognizable. 

The word-based Huffman byte oriented codes proposed in [7] obtain com- 
pression ratios on natural language close to 30% by coding with bytes instead 
of bits (in comparison to the bit oriented approach that achieves ratios close 
to 25%). In exchange, decompression and searching are much faster with byte- 
oriented Huffman code because no bit manipulations are necessary. This word- 
based byte-oriented Huffman code will be called Plain Huffman code in this 
paper. 

Another code proposed in [7] is Tagged Huffman code. This is like Plain 
Huffman, except that the first bit of each byte is reserved to flag whether the 
byte is the first of its codeword. Hence, only 7 bits of each byte are used for the 
Huffman code. Note that the use of a Huffman code over the remaining 7 bits is 
mandatory, as the fiag is not useful by itself to make the code a prefix code. The 
tag bit permits direct searching on the compressed text by simply compressing 
the pattern and then running any classical string matching algorithm. On Plain 
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Huffman this does not work, as the pattern could occur in the text not aligned 
to any codeword [7]. 

While searching Plain Huffman compressed text requires inspecting all its 
bytes from the beginning, Boyer-Moore type searching (that is, skipping bytes) 
[2] is possible over Tagged Huffman code. On the other hand. Tagged Huffman 
code pays a price in terms of compression performance of approximately 11%, 
as it stores full bytes but uses only 7 bits for coding. 

2.2 End- Tagged Dense Codes 

End-Tagged Dense code (ETDC) [3] is obtained by a seemingly dull change to 
Tagged Huffman code. Instead of using a flag bit to signal the beginning of a 
codeword, the end of a codeword is signaled. That is, the highest bit of any 
codeword byte is 0 except for the last byte, where it is 1. By this change there 
is no need at all to use Huffman coding in order to maintain a prefix code. 

In general, ETDC can be defined over symbols of b bits, although in this 
paper we focus on the byte-oriented version where 6 = 8. ETDC is formally 
defined as follows. 

Definition 1 Given source symbols {si, . . . , s„}, End-Tagged Dense Code assigns 
number i — 1 to the i-th most frequent symbol. This number is represented in base 
as a sequence of digits, from most to least significant. Each such digit is 
represented using b bits. The exception is the least significant digit do, where we 
represent -\- do instead of just do. 

That is, the first word is encoded as 10000000, the second as 10000001, until 
the 128*'* as 11111111. The 129"* word is coded as QOOOOOOOTOOOOOOO, 130"* as 
00000000:10000001 and so on until the (128^ -h 128)*'* word 01111111:11111111, 
just as if we had a 14-bit number. 

As it can be seen, the computation of codes is extremely simple: It is only nec- 
essary to sort the source symbols by decreasing frequency and then sequentially 
assign the codewords. The coding phase is faster than using Huffman because 
obtaining the codes is simpler. Empirical results comparing ETDC against Plain 
and Tagged Huffman can be found in [3] . 

Note that the code depends on the rank of the words, not on their actual 
frequency. As a result, it is not even necessary to transmit the code of each word, 
but just the sorted vocabulary, as the model to the receiver. Hence, End-Tagged 
Dense Codes are simpler, faster, and compress better than Tagged Huffman 
codes. Since the last bytes of codewords are distinguished, they also permit 
direct search of the compressed text for the compressed pattern, using any search 
algorithm. 

On-the-Fly Coding and Decoding. We finally observe that, for compression and 
decompression, we do not really have to start by sequentially assigning the codes 
to the sorted words. An on-the-fiy encoding is also possible. 

Given a word ranked i in the sorted vocabulary, the encoder can run a simple 
encode function to compute the codeword Ci = encode{i). It is a matter of 
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Sender ( ) 


Receiver ( ) 


(1) 


Vocabulary ^ {Cnem-Symbol}', 


(1) 


Vocabulary 4 — {Cnew-Symbol] 


(2) 


Initialize CodeBook., 


(2) 


Initialize CodeBook; 


(3) 


for 2 G 1 . . . n do 


(3) 


for 2 G 1 . . . n do 


(4) 


read si from the text; 


(4) 


receive Ci; 


(5) 


if St 0 V ocabulary then 


(5) 


lfC*t — Cnew-Symbol then 


( 6 ) 


send Cnew-Symbol 1 


( 6 ) 


receive Si in plain form; 


(7) 


send Si in plain form; 


(7) 


V ocabulary 4 — V ocabulary 


( 8 ) 


V ocabulary •«— Vocabulary U {st}; 


( 8 ) 


fisi) r- 1; 


(9) 


fisi) r- 1; 


(9) 


else 


(10) 


else 


(10) 


Si ^ CodeBook-^ (Ci)-, 


(11) 


send CodeBook(si); 


(11) 


f(si) ^ /(si)-i-i; 


(12) 


f(si) ^ IGi) 4- 1; 


(12) 


output Si; 


(13) 


Update CodeBook; 


(13) 


Update CodeBook; 



Fig. 1. Sender and receiver processes in statistical dynamic text compression. 

expressing z — 1 in base (which requires just bit shifts and masking) and 
outputting the sequence of digits. Function encode takes just 0{l) time, where 
I = 0{log{i)/b) is the length in digits of codeword Cj. 

At decompression time, given codeword Ci of I digits and the sorted vo- 
cabulary, it is also possible to compute, in 0{l) time, function z = decode{Ci), 
essentially by interpreting Ci as a base number and finally adding 1. Then, 
we retrieve the z-th word in the sorted vocabulary. 

3 Statistical Dynamic Codes 

Statistical dynamic compression techniques are one-pass. Statistics are collected 
as the text is read, and consequently, the model is updated as compression 
progresses. They do not transmit the model, as the receiver can figure out the 
model by itself from the received codes. 

In particular, zero-order compressors model the text using only the informa- 
tion on source symbol frequencies, that is, /(sz) is the number of times source 
symbol Si appears in the text (read up to now). In the discussion that follows 
we focus on zero-order compressors. 

In order to maintain the model up to date, dynamic techniques need a data 
structure to keep the vocabulary of all symbols Si and their frequencies /(si) 
up to now. This data structure is used by the encoding/decoding scheme, and 
is continuously updated during compression/decompression. After each change 
in the vocabulary or frequencies, the codewords assigned to all source symbols 
may have to be recomputed due to the frequency changes. This recomputation 
must be done both by the sender and the receiver. 

Figure 1 depicts the sender and receiver processes, highlighting the symmetry 
of the scheme. CodeBook stands for the model, used to assign codes to source 
symbols or vice versa. 

3.1 Dynamic Huffman Codes 

In [8, 9] an adaptive character-oriented Huffman coding algorithm was presented. 
It was later improved in [12], being named FGK algorithm. FGK is the basis of 
the UNIX compact command. 




Simple, Fast, and Efficient Natnral Language Adaptive Compression 235 



FGK maintains a Huffman tree for the source text already read. The tree 
is adapted each time a symbol is read to keep it optimal. It is maintained both 
by the sender, to determine the code corresponding to a given source symbol, 
and by the receiver, to do the opposite. Thus, the Huffman tree acts as the 
CodeBook of Figure 1. Consequently, it is initialized with a unique special node 
called zero Node (corresponding to new- Symbol), and it is updated every time a 
new source symbol is inserted in the vocabulary or a frequency increases. The 
codeword for a source symbol corresponds to the path from the tree root to the 
leaf corresponding to that symbol. Any leaf insertion or frequency change may 
require reorganizing the tree to restore its optimality. 

The main challenge of Dynamic Huffman is how to reorganize the Huffman 
tree efficiently upon leaf insertions and frequency increments. This is a complex 
and potentially time-consuming process that must be carried out both by the 
sender and the receiver. 

The main achievement of FGK is to ensure that the tree can be updated by 
doing only a constant amount of work per node in the path from the affected leaf 
to the tree root. Calling l{si) the path length from the leaf of source symbol Si to 
the root, and /(s^) its frequency, then the overall complexity of algorithm FGK 
is ^ f{si)l{si), which is exactly the length of the compressed text, measured in 
number of target symbols. 

3.2 Word-Based Dynamic Huffman Codes 

We implemented a word-based version of algorithm FGK. This is by itself a 
contribution because no existing adaptive technique obtains similar compression 
ratio on natural language. As the number of text words is much larger than the 
number of characters, several challenges arised to manage such a large vocabu- 
lary. The original FGK algorithm pays little attention to these issues because of 
its underlying assumption that the source alphabet is not very large. 

However, the most important difference between our word-based version and 
the original FGK is that we chose the code to be byte rather than bit-oriented. 
Although this necessarily implies some loss in compression ratio, it gives a deci- 
sive advantage in efficiency. Recall that the algorithm complexity corresponds to 
the number of target symbols in the compressed text. A bit-oriented approach 
requires time proportional to the number of bits in the compressed text, while 
ours requires time proportional to the number of bytes. Hence byte-coding is 
almost 8 times faster. 

Being byte-oriented implies that each internal node can have up to 256 chil- 
dren in the resulting Huffman tree, instead of 2 as in a binary tree. This required 
extending FGK algorithm in several aspects. 

4 Dynamic End-Tagged Dense Code 

In this section we show how ETDC can be made adaptive. Considering again 
the general scheme of Figure 1, the main issue is how to maintain the GodeBook 
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up to date upon insertions of new source symbols and frequency increments. 
In the case of ETDC, the CodeBook is essentially the array of source symbols 
sorted by frequencies. If we are able to maintain such array upon insertions and 
frequency changes, then we are able to code any source symbol or decode any 
target symbol by using the on-the-fly encode and decode procedures explained 
at the end of Section 2.2. 
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Input order 
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Fig. 2. Transmission of message "the rose rose is beautiful beautiful". 



Figure 2 shows how the compressor operates. At first (step 0), no words have 
been read so new-Symbol is the only word in the vocabulary (it is implicitly 
placed at position 1). In step 1, a new symbol "the" is read. Since it is not in 
the vocabulary, C\ (the codeword of new-Symbol) is sent, followed by "the" in 
plain form (bytes 'f, 'h’, ‘e’ and some terminator ‘#’).Next, "the" is added 
to the vocabulary (array) with frequency 1, at position 1. Implicitly, new-Symbol 
has been displaced to array position 2. Step 2 shows the transmission of "rose", 
which is not yet in the vocabulary. In step 3, "rose" is read again. As it was 
in the vocabulary at array position 2, only codeword C 2 is sent. Now, "rose" 
becomes more frequent than "the", so it moves upward in the array. Note that 
a hypothetical new occurrence of "rose" would be transmitted as C\, while 
it was sent as C 2 in step 1. In steps 4 and 5, two more new words, "is" and 
"beautiful", are transmitted and added to the vocabulary. Finally, in step 6, 
"beautiful" is read again, and it becomes more frequent than "is" and "the". 
Therefore, it moves upward in the vocabulary by means of an exchange with 
"the". 

The main challenge is how to efficiently maintain the sorted array. In the 
sequel we show how we obtain a complexity equal to the number of source 
symbols transmitted. This is always lower than FGK complexity, because at 
least one target symbol must be transmitted for each source symbol, and usually 
several more if some compression is going to be achieved. Essentially, we must 
be able to identify groups of words with the same frequency in the array, and be 
able of fast promoting of a word to the next group when its frequency increases. 

The data structures used by the sender and their functionality are shown 
in Figure 3. The hash table of words keeps in word the source word characters. 
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in posInVoc the position of the word in the vocabulary array, and in freq its 
frequency. In the vocabulary array (posInHT) the words are not explicitly rep- 
resented, but a pointer to word is stored. Finally, arrays top and last tell, for 
each possible frequency, the vocabulary array positions of the first and last word 
with that frequency. It always holds top[f — 1] = last[f] + 1 (so actually only one 
array is maintained). If no words of frequency / exist, then last[f\ = top[f] — I. 
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Fig. 3. Transmission of words: ABABBCC, ABABBCCC and ABABBCCCD. 



When the sender reads word Si, it uses the hash function to obtain its po- 
sition p in the hash table, so that word[p] = Si. After reading / = freq[p], 
it increments freq[p\. The index of Si in the vocabulary array is also obtained 
as i = posInVoc[p] (so it will send code Ci). Now, word Si must be promoted 
to its next group. For this sake, it finds the head of its group j = top[f] and 
the corresponding word position h = posInHT[j], so as to swap words i and 
j in the vocabulary array. The swapping requires exchanging posInHT[j] with 
posInHT[i], setting posInVoc[p\ = j and setting posInVoc[h] = i. Once the 
swapping is done, we promote j to the next group by setting last[f + 1] = j and 
top[f] = j+l. 

If Si turns out to be a new word, we set word[p] = Si, freq[p] = 1, and 
posInVoc[p] = n, where n is the number of source symbols known prior to 
reading Si (and considering new-Symbol). Then exactly the above procedure is 
followed with / = 0 and i = n. Also, n is incremented. 

The receiver works very similarly, except that it starts from i and then it 
obtains p = posInHT[i\. Figure 4 shows the pseudocode. 

Implementing dynamic ETDC is simpler than building dynamic word-based 
Huffman. In fact, our implementation of the Huffman tree update takes about 
120 C source code lines, while the update procedure takes only about 20 lines in 
dynamic ETDC. 

5 Empirical Results 

We tested the different compressors over several texts. As representative of short 
texts, we used the whole Calgary corpus. We also used some large text collections 
from trec-2 (AP Newswire 1988 and Ziff Data 1989-1990) and from trec-4 
(Congressional Record 1993, Financial Times 1991 to 1994). Finally, two larger 
collections, ALL_FT and ALL, were used. ALL_FT aggregates all texts from 
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Sender (s^) 


Receiver (i) 


(1) 


p •«— hash{si); 


(1) 


p •«— posInHT[i]\ 


(2) 


if word[p] — nil then // new word 


(2) 


if word[p] — nil then 


(3) 


word[p] •«— Si‘, 


(3) 


word[p] •«— Si\ 


(4) 


freq[p] ^ 0; 


(4) 


freqlp] ^ 0; 


(5) 


posInVoc[p] •«— n; 


(5) 


posInV oc[p\ •«— n; 


(6) 


posInHT[n] •«— p; 


(6) 


posInHT[n\ •«— p\ 


(7) 


n •«— n + 1; 


(7) 


n •«— n + 1; 


(8) 


/ ^ freq[p]-, 


(8) 


/ ^ freq\p\-, 


(9) 


freq[p] ^ /reqfp] + 1; 


(9) 


freq[p] ^ freq[p] + 


(10) 


i •«— posInVoc[p]\ 


(10) 


i •«— posInVoc[p]‘, 


(11) 


3 ^ iop[/]; 


(11) 


j ^ top[f]- 


(12) 


h •«— posInHT[j]\ 


(12) 


h •«— posInHT[j]\ 


(13) 


posInHT[i] ^ posInHT[j]; 


(13) 


posInHT[i] w posi 


(14) 


posInVoc[p] •«— j\ 


(14) 


posInVoc[p] •«— j‘, 


(15) 


posInVoc[h\ •«— i; 


(15) 


posInVoc[h] •«— z; 


(16) 


lastlf + 1] ^ j; 


(16) 


lastlf + 1] ^ r, 


(17) 


toplf] ^ i -1- 1; 


(17) 


toplf] ^ i + 1; 



Fig. 4. Sender and receiver processes to update CodeBook in ETDC. 



Table 1. Compression ratios of dynamic versus semi-static techniques. 



CORPUS 


TEXT SIZE 
bytes 




Plain Huffman 


End-Tagged Dense Code 




:j-pass 


dynamic 


Increase 


:^-pass 


dynamic 


Increase 


dif^BTDC 


ratio Vo 


ratio % 


dittpf, 


ratio Vo 


ratio % 


dittRTDC 


- diffpp 






















CALGARY 


2,131,045 




46.238 


46.546 


0.308 


47.397 


47.730 


0.332 


0.024 


FT91 


14,749,355 




34.628 


34.739 


0.111 


35.521 


35.638 


0.116 


0.005 


CR 


51,085,545 




31.057 


31.102 


0.046 


31.941 


31.985 


0.045 


-0.001 


FT92 


175,449,235 




32.000 


32.024 


0.024 


32.815 


32.838 


0.023 


-0.001 


ZIFF 


185,220,215 




32.876 


32.895 


0.019 


33.770 


33.787 


0.017 


-0.002 


FT93 


197,586,294 




31.983 


32.005 


0.022 


32.866 


32.887 


0.021 


-0.001 


FT94 


203,783,923 




31.937 


31.959 


0.022 


32.825 


32.845 


0.020 


-0.002 


AP 


250,714,271 




32.272 


32.294 


0.021 


33.087 


33.106 


0.018 


-0.003 


ALL.FT 


591,568,807 




31.696 


31.710 


0.014 


32.527 


32.537 


0.011 


-0.003 


ALL 


1,080,719,883 




32.830 


32.849 


0.019 


33.656 


33.664 


0.008 


-0.011 



Financial Times collection. ALL collection is composed by Calgary corpus and 
all texts from trec-2 and trec-4. 

A dual Intel®Pentium®-III 800 Mhz system, with 768 MB SDRAM-lOOMhz 
was used in our tests. It ran Debian GNU/Linux (kernel version 2.2.19). The 
compiler used was gcc version 3.3.3 20040429 and -09 compiler optimizations 
were used. Time results measure CPU user-time. The spaceless word model [6] 
was used to model the separators. 

Table 1 compares the compression ratios of two-pass versus one-pass tech- 
niques. Columns labeled diff measure the increase, in percentual points, in the 
compression ratio of the dynamic codes compared against their semi-static ver- 
sion. The last column shows those differences between Plain Huffman and ETDC. 

To understand the increase of size of dynamic versus semi-static codes, two 
issues have to be considered: (t) each new word Si parsed during dynamic com- 
pression is represented in the compressed text (or sent to the receiver) as a pair 
{Cnew- Symbol, Si), while in two-pass compression only the word Si needs to be 
stored/transmitted in the vocabulary; (ii) on the other hand, some low-frequency 
words can be encoded with shorter codewords by dynamic techniques, since by 
the time they appear the vocabulary may still be small. 

Compression ratios are around 30-35% for the larger texts. For the smaller 
ones, compression is poor because the size of the vocabulary is proportionally 
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Table 2. Comparison between dynamic ETDC and dynamic PH. 



CORPUS 


TEXT SIZE 
bytes 




1 Dyn PH | 


1 Dyn ETDC | 


Increase 
size % 


Decrease 
time % 


|time (sec) 


|ratio%| 


[time (sec) 


[ratio 


CALGARY 


2,131,045 


30,995 


0.520 


46.546 


0.384 


47.730 


2.543 


22.892 


FT91 


14,749,355 


75,681 


3.428 


34.739 


2.488 


35.638 


2.588 


22.685 


CR 


51,085,545 


117,713 


11.450 


31.102 


8.418 


31.985 


2.839 


22.629 


FT92 


175,449,235 


284,892 


41.330 


32.024 


31.440 


32.838 


2.542 


26.404 


ZIFF 


185,220,215 


237,622 


44.628 


32.895 


33.394 


33.787 


2.710 


22.559 


FT93 


197,586,294 


291,427 


47.118 


32.005 


36.306 


32.887 


2.755 


20.840 


FT94 


203,783,923 


295,018 


48.260 


31.959 


36.718 


32.845 


2.774 


22.006 


AP 


250,714,271 


269,141 


60.702 


32.294 


47.048 


33.106 


2.514 


22.796 


ALL.FT 


591,568,807 


577,352 


143.050 


31.710 


111.068 


32.537 


2.609 


23.796 


ALL 


1,080,719,883 


886,190 


268.983 


32.849 


213.068 


33.664 


2.481 


25.927 



Table 3. Comparison of compression ratio against gzip and bzip2. 



CORPUS 


TEXT SIZE 


1 compression ratio % | 


bytes 


1 Oyn PHj 


1 Dyn ETDCj 


1 gzi 


P 


1 gzip -b| 


1 bzip2 -f 


1 bzip2 -b| 








CALGARY 


2,131, 
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43, 


530 


36.840 


32, 


827 


28.924 


FT91 


14,749 


,355 


34.739 


35. 
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42. 


566 


36.330 


32 


305 


27.060 


CR 
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,545 
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31, 
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39. 
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33.176 


29. 


507 


24.142 


FT92 


175,449 


,235 


32.024 


32. 


838 


42. 


585 


36.377 


32, 


369 


27.088 


ZIFF 


185,220 


,215 


32.895 


33 


787 


39, 


656 


32.975 


29. 


642 


25.106 


FT93 


197,586 


,294 


32.005 


32 


887 


40. 


230 


34.122 


30. 


624 


25.322 


FT94 


203,783 


,923 


31.959 


32 


845 


40. 


236 


34.122 


30. 


535 


25.267 


AP 


250,714 


,271 


32.294 


33 


106 


43. 


651 


37.225 


33 


260 


27.219 


ALL.FT 


591,568 


,807 


31.710 


32 


537 


40. 


988 


34.845 


31. 


152 


25.865 


ALL 


1,080,719 


,883 


32.849 


33 


664 


41, 


312 


35.001 


31. 


304 


25.981 



too large with respect to the compressed text size (as expected from Heaps’ law 
[10]). This means that proportionally too many words are transmitted in plain 
form. 

The increase of size of the compressed texts in ETDC compared to PH is 
always under 1 percentage point, in the larger texts. On the other hand, the dy- 
namic versions lose very little in compression (maximum 0.02 percentage points, 
0.06%) compared to their semi-static versions. This shows that the price paid 
by dynamism in terms of compression ratio is negligible. Note also that in most 
cases, and in the larger texts, dynamic ETDC loses even less compression than 
dynamic Plain Huffman. 

Table 2 compares the time performance of our dynamic compressors. The 
latter two columns measure the increase in compression ratio (in percentage) of 
ETDC versus Plain Huffman, and the reduction in processing time, in percent- 
age. 

As it can be seen, dynamic ETDC loses less than 1 percentage point (3%) 
of compression ratio compared to dynamic Plain Huffman, in the larger texts. 
In exchange, it is 22%-26% faster and considerably simpler to implement. Dy- 
namic Plain Huffman compresses 4 megabytes per second, while dynamic ETDC 
reaches 5. 

Tables 3 and 4 compare both dynamic Plain Huffman and dynamic ETDC 
against gzip (Ziv-Lempel family) and hzip2 (Burrows- Wheeler [5] type tech- 
nique). Experiments were run setting gzip and bzip2 parameters to “best” (-b) 
and “fast” (-f) compression. 

As expected “bzip2 -b” achieves the best compression ratio. It is about 6 and 
7 percentage points better than dynamic PH and dynamic ETDC respectively. 
However, it is much slower than the other techniques tested in both compression 
and decompression processes. Using the “fast” hzip2 option seems to be undesir- 
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able, since compression ratio gets worse (it becomes closer to dynamic PH) and 
compression and decompression speeds remain poor. 

On the other hand, “gzip -f’ is shown to achieve good compression speed, at 
the expense of compression ratio (about 40%). It is shown that dynamic ETDC is 
also a fast technique. It is able to beat “gzip -f’ in compression speed (except in 
the ALL corpus) . Regarding to compression ratio, dynamic ETDC achieves also 
best results than “gzip -b” (except in CALGARY and ZIFF corpora). However, 
gzip is clearly the fastest method in decompression. 

Hence, dynamic ETDC is either much faster or compresses much better than 
gzip, and it is by far faster than bzip2. 



Table 4. Comparison of compression and decompression time against gzip and bzip2. 



CORPUS 


1 


compression time 












1 










imBV-MiJ 














CALGARY 


0,498 


0,384 


0,360 


2,180 


2,660 


0,330 


0,240 


0,090 


0,775 


0,830 


FT91 


3,218 


2,488 


2,720 


14,380 


18,200 


2,350 


1,545 


0,900 


4,655 


5,890 


CR 


10,880 


8,418 


8,875 


48,210 


65,170 


7,745 


5,265 


3,010 


15,910 


19,890 


FT92 


42,720 


31,440 


34,465 


166,310 


221,460 


30,690 


19,415 


8,735 


57,815 


71,050 


ZIFF 


43,122 


33,394 


33,550 


174,670 


233,250 


30,440 


11,690 


9,070 


58,790 


72,340 


FT93 


45,864 


36,306 


36,805 


181,720 


237,750 


32,780 


21,935 


10,040 


62,565 


77,860 


FT94 


47,078 


36,718 


37,500 


185,107 


255,220 


33,550 


22,213 


10,845 


62,795 


80,370 


AP 


60,940 


47,048 


50,330 


231,785 


310,620 


43,660 


27,233 


15,990 


81,875 


103,010 


ALL.FT 


145,750 


91,068 


117,255 


558,530 


718,250 


104,395 


66,238 


36,295 


189,905 


235,370 


ALL 


288,778 


213,905 


188,310 


996,530 


1342,430 


218,745 


126,938 


62,485 


328,240 


432,390 



6 Conclusions 

In this paper we have considered the problem of providing adaptive compression 
for natural language text, with the combined aim of competitive compression 
ratios and good time performance. 

We built an adaptive version of word-based Huffman codes. For this sake, we 
adapted an existing algorithm so as to handle very large sets of source symbols 
and byte-oriented output. The latter decision sacrifices some compression ratio 
in exchange for an 8-fold improvement in time performance. The resulting al- 
gorithm obtains compression ratio very similar to its static version (0.06% off) 
and compresses about 4 megabytes per second on a standard PC. 

We also implemented a dynamic version of the End-Tagged Dense Code 
(ETDC). The resulting adaptive version is much simpler than the Huffman- 
based one, and 22%-26% faster, compressing typically 5 megabytes per second. 
The compressed text is only 0.06% larger than with semi-static ETDC and 3% 
larger than with Huffman. 

As a result, we have obtained adaptive natural language text compressors 
that obtain 30%-35% compression ratio, and compress more than 4 megabytes 
per second. Empirical results show their good performance when they are com- 
pared against other compressors such as gzip and bzip2. 

Future work involves building an adaptive version of (s, c)-Code [4], an ex- 
tension to ETDC where the number of byte values that signal the end of a 
codeword can be adapted to optimize compression, instead of being fixed at 128 
as in ETDC. An interesting problem in this case is how to efficiently maintain 
the optimal (s, c), which now vary as compression progresses. 
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Abstract. The issue of information retrieval in XML documents was first inves- 
tigated by the database community. Recently, the Information Retrieval (IR) 
community started to investigate the XML search issue. For this purpose, tradi- 
tional information retrieval models were adapted to process XML documents 
and rank results by relevance. In this paper, we describe an IR approach to deal 
with queries composed of content and structure conditions. The XFIRM model 
we propose is designed to be as flexible as possible to process such queries. It is 
based on a complete query language, derived from Xpath and on a relevance 
values propagation method. The value of this proposed method is evaluated 
thanks to the INEX evaluation initiative. Results show a relative high precision 
of our system. 



1 Introduction 

Users looking for precise information do not want to be submerged by noisy subjects, 
as it can be in long documents. One of the main advantages of the XML format is its 
capacity to combine structured and un-structured (i.e. text) data. As a consequence, 
XML documents allow information to be processed at another granularity level than 
the whole document. The main challenge in XML retrieval is to retrieve the most 
exhaustive' and specific^ information unit [12]. Approaches dealing with this chal- 
lenge can he divided into two main sub-groups [5]. On the one hand, the data- 
oriented approaches use XML documents to exchange structured data. The database 
community was the first to propose solutions for the XML retrieval issue, using the 
data-oriented approaches. In the Xquery language proposed by the W3C [25], SQL 
functionalities on tables (collection of tuples) are extended to support similar opera- 
tions on forests (collection of trees), as XML documents can be seen as trees. Unfor- 
tunately, most of the proposed approaches typically expect binary answers to very 
specific queries. However, an extension of XQuery with full-text search features is 
expected [26]. On the other hand, the document-oriented approaches consider that 
tags are used to describe the logical structure of documents. The IR community has 
adapted traditional IR approaches to address the user information needs in XML 
collection. 



' An element is exhaustive to a query if it contains all the required information. 
^ An element is specific to a query if all its content concerns the query. 

A. Apostolico and M. Melucci (Eds.): SPIRE 2004, LNCS 3246, pp. 242-254, 2004. 

© Springer-Verlag Berlin Heidelberg 2004 
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The goal of this paper is to show that the approach we proposed, which belongs to 
the document-centric view, can also give good results for specific queries (regarding 
structure) containing content conditions. The following section gives a brief view of 
related work. Then, in section 3, we present the XFIRM (XML Flexible Information 
Retrieval Model) model and the associated query language. Section 4 presents the 
INEX initiative for XML retrieval evaluation and evaluates our approach via experi- 
ments carried out on the INEX collection. 

2 Related Work: Information Retrieval Approaches 
for XML Retrieval 

One of the first IR approaches proposed for dealing with XML documents was the 
“fetch and browse” approach [3, 4], saying that a system should always retrieve the 
most specific part of a document answering a query. This definition assumes that the 
system first searches whole documents answering the query in an exhaustive way (the 
fetch phase) and then extracts the most specific information units (the browse phase). 
Most of the Information Retrieval Systems (IRS) dealing with XML documents allow 
information units to be directly searched, without first processing the whole docu- 
ments. Let us describe some of them. 

The extended boolean model uses a new non-commutative operator called “con- 
tains”, that allows queries to be specified completely in terms of content and structure 
[ 11 ]. 

Regarding the vector space model, the similarity measure is extended in order to 
evaluate relations between structure and content . In this case, each index term should 
be encapsulated by one or more elements. The model can be generalized with the 
aggregation of relevance scores in the documents hierarchy [7]. In [22], the query 
model is based on tree matching: it allows the expression of queries without perfectly 
knowing the data structure. 

The probabilistic model is applied to XML documents in [12, 24, 5]. The XIRQL 
query language [5] extends the Xpath operators with operators for relevance-oriented 
search and vague searches on non-textual content. Documents are then sorted by 
decreasing probability that their content is the one specified by the user. 

Language models are also adapted for XML retrieval [1, 15]. Einally, bayesian 
networks are used in [17]. 

In [9], Fuhr and al. proposed an augmentation method for dealing with XML 
documents. In this approach, standard term weighting formulas are used to index so 
called “index nodes” of the document. Index nodes are not necessarily leaf nodes, 
because this structure is considered to be too fine-grained. However, index nodes are 
disjoint. In order to allow nesting of nodes, in case of high-level index nodes com- 
prising other index nodes, only the text that is not contained within the other index 
nodes is indexed. For computing the indexing weights of inner nodes, the weights 
from the most specific index-nodes are propagated towards the inner nodes. During 
propagation, however, the weights are down-weighted by multiplying them with a so- 
called augmentation factor. In case a term at an inner node receives propagated 
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weights from several leaves, the overall term weight is computed hy assuming a 
probahilistic disjunction of the leaf term weights. This way, more specific elements 
are preferred during retrieval. 

The approach we describe in this paper is also based on an augmentation method. 
However, in our approach, all leaf nodes are indexed, because we think that even the 
smallest leaf node can also contain relevant information. Moreover, the way rele- 
vance values are propagated in the document tree is function of the distance that sepa- 
rates nodes in the tree. The following section describes our model. 



3 The XFIRM Model 

3.1 Data Representation 

A structured document sdj is a tree, composed of simple nodes rij, leaf nodes Itij and 
attributes a,. Formally, this can be written as follows : sd, = (treed = ([nj, {InJ, faj). 
This representation is a simplification of Xpath and Xquery data model [27], where a 
node can be a document, an element, text, a namespace, an instruction or a comment. 
In order to easy browse the document tree and to quickly find ancestors-descendants 
relationships, the XFIRM model uses the following representation of nodes and at- 
tributes, based on the Xpath Accelerator approach [10 ]: 

Node: nj =(pre, post, parent, attribute) 

Leaf node: luj = (pre, post, parent, {tj,t 2 ,...tJ) 

Attribute: a^=(pre, val) 

A node is defined thanks to its pre-order and post-order value (pre and post), the 
pre-order value of its parent node (parent), and depending on its type (simple node of 
leaf node), by a field indicating the presence or absence of attributes (attribute) or by 
the terms it contains (ftj,t 2 ,...,tJ). An attribute is defined by the pre-order value of the 
node containing it (pre) and by its value (val). Pre-order and post-order values are 
assigned to nodes thanks respectively to a prefixed and post-fixed traversal of the 
document tree, as illustrated in the following figure. 



<article> 


<p> Internet growth. . .</p> 


<fm> 


</sec> 


<title> Search engines : how to find a nee- 


<sec > 


dle in a haystack</title> 


<st> Search engines </st> 


<author > J. Dupont </author> 


<p> Yahoo! is ...</p> 


<year> 1998 </year> </fm> 


<p> Google is a full-text search engine </p> 


<bdy> 


</sec> 


<sec > 


</bdy> 


<st> Introduction </st> 


</article> 



Fig. 1. Example of XML document 
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Fig. 2. Tree representation of the XML document in Figure 1. Each node is assigned a pre- 
order and post-order value 



If we transpose nodes in a two-dimensions space based on the pre and post order 
coordinates, we can exploit the following properties, given a node n: 

- all ancestors of n are to the upper left of n ’s position in the plane 

- all its descendants are to the lower right, 

- all preceding nodes in document order are to the lower left, and 

- the upper right partition of the plane comprises all following nodes (regarding 
document order) 

In contrast to other path index structures for XML, Xpath Accelerator efficiently 
supports path expressions that do not start at the document root. As explained in [19], 
all data are stored in a relational database. The Path Index (PI) allows the reconstruc- 
tion of the document structure (thanks to the Xpath Accelerator model). The Term 
Index (TI) is a traditional inverted file. The Element Index (IE) describes the content 
of each leaf node, the Attribute Index (AI) gives the values of attributes, and the Dic- 
tionary (DICT) allows the grouping of tags having the same signification. 

3.2 The XFIRM Query Language 

XFIRM is based on a complete query language, allowing the expression of queries 
with simple keywords terms and/or with structural conditions [20]. In its more com- 
plex form, the language allows the expression of hierarchical conditions on document 
structure and the element to be returned to the user can he specified (thanks to the te: 
(target element) operator). For example, the following XFIRM queries: 

(i) // te: p [weather forecasting systems] 

( ii) // article] security] // te: sec [“facial recognition ”] 

(Hi) // te: article [Petri net] //sec [formal definition ] AND sec]algorithm efficiency] 
(iv) // te: article ]] // sec [search engines] 

respectively mean that (i) the user wants a paragraph about weather forecasting sys- 
tems, (ii) a section about facial recognition in an article about security , (iii) an article 
about Petri net containing a section giving a formal definition and another section 
talking about algorithm efficiency, and (iv) an article containing a section about 
search engines. 
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When expressing the eventual content conditions, the user can use simple key- 
words terms (or phrases), eventually preceded by H- or - (which means that the term 
should or should not be in the results). Terms can also be connected with Boolean 
operators. Regarding the structure, the query syntax allows the user to formulate 
vague path expressions. For example, he/she can ask for “article [] // sec [ ]” (he/she 
so knows that article nodes have sections nodes as descendants), without necessarily 
asking for a precise path, i.e. article/bdy/sec. Moreover, a tag dictionary is used in 
query processing. It is useful in case of heterogeneous collections (i.e. XML docu- 
ments don’t necessary follow the same DTD) or in case of documents containing tags 
considered as equivalent, like for example, title and sub-title. 

3.3 Query Processing 

The approach we propose for dealing with queries containing content and structure 
conditions is based on relevance weights propagation. The query evaluation is carried 
out as follows: 

1 . queries are decomposed in elementary sub-queries 

2. relevance values are assigned to leaf nodes 

3. relevance values are propagated through ther document tree 

4. original queries are evaluated thanks to elementary sub-queries 

Query Decomposition 

Each XFIRM query can be decomposed in sub-queries SQ^a& follows: 

Q =//SQ, //SQ^/Z-Z/te : SQjZZ...ZZSQ„ (1) 

Where te: indicates which element is the target element. Each sub-query SQ. can 
then be re-decomposed in elementary sub-queries ESQ-j, eventually linked with boo- 
lean operators and of the form: 

ESQ.j=tg[q] (2) 

Where tg is a tag name and q = (tj, ...tj is a set of keywords, i.e. a content condition. 

Evaluating Leaf Nodes Relevance Values 

The first step in query processing is to evaluate the relevance value of leaf nodes In 
according to the content conditions (if they exist). Let q=(tp...,tj be a content condi- 
tion. Relevance values are evaluated thanks to a similarity function called RSV^(q,nf), 
where m is an IR model. XFIRM authorizes the implementation of many IR models. 
As the purpose of this article is to evaluate the interest of relevance values propaga- 
tion, we choose to take the vector space model as reference. So: 

R 5 V(^,ln) = ^w/*w/" > with = tf.'’ * ief^ And w-"' =tf^" *ief^ (3) 

i=\ 

Where: tf- is the term frequency in the query q or in the leaf node In, ief^ is the in- 
verse element frequency of term i, i.e. log (NZn-t-l n is the number of leaf nodes 
containing i and N is the total number of leaf nodes. 
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Elementary Sub-queries ESQy Processing 

The result set Rjj of ESQjjis a set of pairs (node, relevance) defined as follows: 

Rij = { (n, rj/n e{constmct(tg)j and = F^. (RSVJq, nf^), dist(n,nfj) } (4) 

Where: is the relevance weight of node n ; the constmct(tg) function allows the 

creation of the set of all nodes having tg as tag name ; the F^. (RSVJq, nfj, 
dist(n,nff,)) function allows the propagation and aggregation of relevance values of 
leaf nodes nf^. , descendants of node n, in order to form the relevance value of node n. 
This propagation is function of distance dist(n,nfiJ which separates node n from leaf 
node in the document tree (i.e. the number of arcs that are necessary to join n and 
>^fk )• 

Subqueries SQj Processiug 

Once each ESQy has been processed, subqueries SQ- are evaluated thanks to the 
commutative operators et ©qj^ defined below: 

Defiuitiou 1 : Let N = { (n, r^) j and M = (m, rj j be two sets of pairs (node, rele- 
vance) 

N ©and ^ - I ^ the nearest common ancestor of m and n or l=m (re 

spectively n) if m (resp .n) is ancestor ofn (resp. m) , V m, n being in the same docu 
ment and r,= aggreg^^^^Jr^ , r^, , dist(l,n), dist(l,m) )j (5) 

N ©OR M = { (I, ri)/ l=ne N or l=me M and r, = r^ or r^ } (6) 

Where aggreg^j,^Jr^ , r^ , dist(l,n),dist(l,m))= r^ defines the way relevance values 
r^ and r^ of nodes n and m are aggregated in order to form a new relevance r,. 

Let /?. be the result set of SQ^. Then: 

If 52,. = £52,. ., then R. = R. . (7) 

If SQ, = FSQ.J AND FSQ, , , then R, = R, . R, , (8) 

If SQ, = FSQ, . OR FSQ„ , then R, = R, . ©„r R,, (9) 



Whole Queries Processiug 

The result set of sub-queries SQ, are then used to process whole queries. In each 
query, a target element is specified, as defined above. 

Q =//SQj //SQJ/... /He : SQJ/...//SQ^ 

Thus, the aim in whole query processing will be to propagate the relevance values 
of sub-queries SQ; to nodes belonging to the result set of the sub-query SQj which 
defines the target element. This is obtained thanks to the non-commutative operators 
V and A defined below: 
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Definition 2 : Let R. = {(n, r^)j and = {(m, r^)j be two sets of pairs (node, rele- 
vance) 

Ri ^Ri.i = j(n, rj/ ne R- is ancestor of me R-^j and 

r„=prop_agg( r^ , r^, dist( m, n))} ( 1 0) 

Ri^Ri.i = {(n, rJ / n e is descendant of m e R.^j and 

I'n = prop_agg( , r^, dist( m, n))j (11) 

Where prop_agg(r^ , r^, dist(m,n))-> allows the aggregation of relevance 
weights of node m and of node n according to the distance that separates the 2 
nodes, in order to obtain the new relevance weight of node n . 

The result set R of a query Q is then defined as follows: 

R = R^.V(Rj,i V(Rj^2V...)) (12) 

R = R^.A(Rj.iA(Rj_2A...)) 

In fact, this is equivalent to propagate relevance values of results set Rj^j, ...,R„ 
and Rj,. . .,Rjj respectively upwards and downwards in the document tree. 



4 Experiments and Results 

4.1 The SCAS Task in the INEX Initiative 

Evaluating the effectiveness of XML retrieval systems requires a test collection 
(XML documents, task/queries, and relevance judgments) where the relevance as- 
sessments are provided according to a relevance criterion that takes into account the 
imposed structural aspects [6]. The Initiative for the Evaluation of XML Retrieval 
tends to reach this aim. INEX collection, 21 IEEE Computer Society journals from 
1995-2002 consists of 12 135 documents with extensive XML-markup. 

Participants to INEX SCAS task (Strict Content and Structure Task) have to per- 
form CAS (Content and Structure) queries, which contain explicit references to the 
XML structure, and restrict the context of interest and/or the context of certain search 
concepts. One can found an example of INEX 2003 CAS query below. 



<inex_topic topic_id=”64” query _type=”CAS”> 

<title> //article[about(./,’hollerith’)] // sec[about(./, ‘DEHOMAG’)] </title> 

<description> In articles discussing Herman Hollerith find sections that mention 
DEHOMAG </description> 

<narrative> Relevant sections deal with DEHOMAG in documents that discuss work or life 
of Herman Hollerith </narrative> 

<keywords> Hollerith, DEHOMAG, Deutsche Hollerith Maschinen Gesellschaft 
</keywords> 

</inex_topic> 



Fig. 5. Example of CAS query 
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The INEX metric for evaluation is based on the traditional recall and precision 
measures. To obtain recall/precision figures, the two dimensions of relevance (ex- 
haustivity and specificity) need to be quantised onto a single relevance value. Quanti- 
sation functions for two user standpoints were used: (i) a “strict” quantisation to 
evaluate whether a given retrieval approach is capable of retrieving highly exhaustive 
and highly specific document components, (ii) a “generalised” quantisation has been 
used in order to credit document components according to their degree of relevance. 

Some Approaches 

In INEX 2003, most of the approaches used IR models to answer the INEX tasks, 
which shows the increased interest of the IR community to XML retrieval. 

Some approaches used a fetch and browse strategy [21, 16], which didn’t give as 
good results as expected. The Queensland University of Technology used a filtering 
method to find the most specific information units [8]. The vector space model was 
adapted in [14], using 6 different index for terms (article, section, paragraph, ab- 
stract,...). Einally language models were used in [2, 13] and [23]. Last cited obtained 
the best of all performances, using one language model per element. 

In the following, we present the results of the experiments we conducted in the 
INEX collection in order to evaluate several possible implementations of our model. 

4.2 Various Propagation Functions 

5 propagation functions have been evaluated. 

F^. (RSVJq, nfjJ, dist(n,nfi^)) (4) is set to: 

F,(RSV (q,nf,),dist (n,nf,))= a *RSV (n,nfj 

k = \..n 

aggregj^j^j^(r^ , r^, , dist(l,n), dist(l,m) ) (5) is either set to : 

aggreg (r„ , , dist dist (l,m )) = — - 

dist {l,n) dist {m,l) 

aggreg j^,^{r^,r^,dist{l,n),dist{l,m)) = 

And finally, prop_agg(dist(m,n), , r^) (10) is either set to.- 

r + r 

prop _ agg (dist (m ,n), r„,r^) = " "• 

dist {m,n) 

prop _ agg (dist (m, n), r„,rj = *r^ + r„ 

Where a e ]0..1] is a parameter used to adjust the importance of the distance be- 
tween nodes in the different functions and dist(x,y) is the distance which separates 
node X from node y in the document tree. 

4.3 Implementation Issues 

The transformation of INEX CAS queries to XEIRM queries was fairly easy. Tablel 
gives some correspondences: 



(13) 

(14) 

(15) 

(16) 
(17) 
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Table 1. Transformation of INEX topics into XFIRM queries 



INEX topic 


XFIRM query 


//article [about/., ’clustering + distributed’) 
and about) .//sec, ’ java’)] 


// te: article [clustering + distributed] 
// sec [java] 


//article[about(./sec,’”e- commerce’”) 
// abs[about(., ‘tmst authentication’)] 


//article [] AND sec[“e- commerce”] 
// te: abs [trust authentication] 


//article[(.//yr=’2000’ OR .//yr=’ 1999’) AND 
about/., “intelligent transportation system’”) 
// sec [about/., ’automation +vehicle)] 


//article [“intelligent transportation system”] 
// te: sec [automation + vehicle] 



During ESQ^j processing, the most relevant leaf nodes are found, and for each of 
these leaf nodes, XFIRM looks for ancestors. In order to have a correct response time 
of the system, the propagation is stopped when 1500 “correct” ancestors are found 
(i.e. ancestors having a correct tag name). 

When a INEX topic contains a condition on the article publication date (as its the 
case in the last query of Table 1), this condition is not translated in the XFIRM lan- 
guage, because propagation with a very common term (like a year) is too long. To 
solve this issue, queries are processed by XFIRM without this condition, and results 
are then filtered on the article publication date. 

Finally, the Dictionary index is used to find equivalent tags. For example, accord- 
ing to INEX guidelines, sec (section) nodes are equivalent to ssl, ss2 and ss3. 

4.4 Runs 



We evaluated 5 runs, combining the different functions: 



Run name 


Prop, functions 


a 


Topic Fields 


xfirm.TK.alpha=0.7 


(13) (15) (17) 


0.7. 


Title+Keywords 


xfirm.TK.alpha=0.9 


(13) (15) (17) 


0.9. 


Title+Keywords 


xfirm.TK.alpha= I 


(13) (15) (17) 


a = 1 


Title+Keywords 


xfirm.TK.mix 


(13) (14) (16) 


a = 0.9 for (13). 


Title+Keywords 


xfirm.T.mix: 


(13) (14) (16) 


a = 0.9 for (13) 


Title 



In addition, these runs were compared to the best run we performed last year in the 
Inex SCAS task with our fetch and browse method: Mercure2.pos_cas_ti [21]. 

4.5 Analysis of the Results 

Table 2 shows the average precision (for strict and generalized quantization) obtained 
by each run over 30 queries. The associated recall-precision curves for strict quantiza- 
tion are plotted in Figure 6. 

The first point to be noticed is the relatively high precision for all runs. Table 3 
shows our runs if they were integrated in the official INEX results for strict quantiza- 
tion. Best results were obtained by the University of Amsterdam, using language 
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Table 2. Average precision for our 6 runs 





Average precision 
(strict quantization) 


Average precision 
(generalized quantization) 


Xfirm.TK.alpha=0.7 


0,2346 


0,2253 


Xfirm.TK.alpha=0.9 


0,2766 


0,2279 


Xfirm.TK. alpha= 1 


0,2783 


0,2257 


Xfirm.TK.mix 


0,2898 


0,2300 


Xfirm.T.mix 


0,2675 


0,2276 


Mercure2.pos_cas_ti 


0,1620 


0,1637 




Fig. 6. Average/precision curves for strict quantization 



models [23]. Most of our runs would have been ranked between the second and third 
position, before the Queensland University of Technology [16], who processed que- 
ries with a fetch and browse approach. 

The propagation method we used increases in a very significant way the results we 
obtained with our “fetch and browse” method (run Mercure2.pos_cas_ti). This is not 
really surprising, because the XFIRM model is able to process all the content condi- 
tions, whereas the run performed with Mercure system only verify that conditions on 
the target element are respected. Moreover, the processing time for each query is of 
course lower (because thanks to the index structure, the XFIRM model has not to 
browse each exhaustive document to find the specific elements). The use of distance 
between nodes seems to be a useful parameter for the propagation functions. It can be 
noticed that the Xfirm.TK.mix run where distance is considered, obtains best average 
precision than the Xfirm.TK.alpha=l run, where the distance had no importance. 
However, the three runs evaluated with different values of a (Xfirm.TK.aplha= 0.7, 
Xfirm.TK.aplha=0.9 , Xfirm.TK.alpha=l) show that the distance should be consid- 
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ered carefully. Indeed, when relevance values are too down-weighted by the distance, 
the performances decrease. 

Finally, the use of title and keywords fields of INEX topics increases the average 
precision of the xfirm.TK.mix run comparing to the .xfirm.T.mix run, even if it de- 
creases the precision for some particular queries. 

So, the relevance propagation method seems to give good results, using all leaf 
nodes as start point to the propagation. Our methods have to be explored on other 
topics/collections to confirm these performances. Moreover, the IR model (i.e. the 
vector space model) used for relevance value calculation needs more investigations, 
the formula used for these experiments being not normalized. Further experiments 
will be necessary, for example with the bm25 formula [18]. 



Table 3. Ranking of official INEX submissions and of our runs for strict quantization. Please 
note that most of them are too in the “top ten” for generalized quantization 



rank 


Avg 

precision 


Organisation 


Run ID 


1 


0.3182 


U. of Amsterdam 


UamsI03-SCAS-MixedScore 


2 


0.2987 


U. of Amsterdam 


UamsI03-SCAS-ElementScore 




0.2898 




Xfirm. TK.mix 




0.2783 




Xfirm. TK.alpha=l 




0.2766 




Xfirm. TK.alpha=0.9 




0.2675 




Xfirm.T.mix 


3 


0.2601 


Queensland Univ. of Technology 


CASQuery_l 


4 


0.2476 


University of Twente and CWI 


LMM-ComponentRetrieval-SCAS 


5 


0.2458 


IBM, Haifa Research lab 


SCAS-TK- With-Clustering 


6 


0.2448 


Universitat Duisburg-Essen 


Scas03-way 1 -alias 


7 


0.2437 


RMIT University 


RMIT_SCAS_1 


8 


0.2419 


RMIT University 


RMIT_SCAS_2 


9 


0.2405 


IBM, Haifa Research lab 


SCAS-TK- With-No- Clustering 


10 


0.2352 


RMIT University 


RMIT_SCAS_3 




0.2346 




Xfirm. TK.alpha=0. 7 


24 


0.1641 


IRIT 


Mercure2.pos_cas_ti 



5 Conclusion 

We have presented here an approach for XML content and structure-oriented search 
that addresses the search issue from an IR viewpoint. We have described the XFIRM 
model and a relevance values propagation method that allows the ranking of informa- 
tion units according to their degree of relevance. This propagation method is based on 
relevance values calculation for each leaf node (thanks to the vector space model) and 
then on propagation functions using the distance between nodes to aggregate the 
relevance values. The XFIRM model decomposes each query in elementary sub- 
queries to process them and then recomposes the original query to respect the even- 
tual hierarchical conditions. 
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This method achieves good results on the INEX topics. Further experiments 
should be achieved to evaluate the impact of the IR model used for leaf nodes rele- 
vance values calculation and to confirm results on other topics/collections. 
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Abstract. To date, attempts for applying syntactic information in the 
document-based retrieval model dominant have led to little practical 
improvement, mainly due to the problems associated with the integration 
of this kind of information into the model. In this article we propose the 
use of a locality-based retrieval model for reranking, which deals with 
syntactic linguistic variation through similarity measures based on the 
distance between words. We study two approaches whose effectiveness 
has been evaluated on the CLEF corpus of Spanish documents. 



1 Introduction 

Syntactic processing has been applied repeatedly in the field of Information 
Retrieval (IR) for dealing with the syntactic variation present in natural language 
texts [14,8, 11], although its use in languages other than English has not as yet 
been studied in depth. In order to apply these kind of techniques, it is necessary 
to perform some kind of parsing process, which itself requires the definition of 
a suitable grammar. For languages lacking advanced linguistics resources, such 
as wide-coverage grammars or treebanks, the application of these techniques is 
a real challenge. In the case of Spanish, for example, only a few IR experiments 
involving syntax have been performed [1, 18, 20, 19]. Even when reliable syntactic 
information can be extracted from texts, the issue that arises is how to integrate 
it into an IR system. The prevalent approaches consist of a weighted combination 
of multi-word terms - in the form of head-modifier pairs - and single-word terms 
- in the form of word stems. Unfortunately, the use of multi-word terms has not 
proven to be effective enough, regardless of whether they have been obtained by 
means of syntactic or statistical methods, mainly due to the difficulty of solving 
the overweighting of complex terms with respect to simple terms [13]. 

In this context, pseudo-syntactic approaches based on the distance between 
terms arise as a practical alternative that avoids the problems listed above as a 
result of not needing any grammar or parser, and because the information about 
the occurrence of individual words can be integrated in a consistent way with 
the information about proximity to other terms, which in turn is often related 
with the existence of syntactic relations between such terms. 
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In this work we propose the use of a locality-based retrieval model, based on 
a similarity measure computed as a function of the distance between terms, as a 
complement to classic IR techniques based on indexing single-word terms, with 
the aim of increasing the precision of the documents retrieved by the system in 
the case of Spanish. 

The rest of the article is organized as follows. Section 2 introduces the locality- 
based retrieval model and our first approach for integrating it into our system; 
the experimental results of this first proposal are shown in Section 3. A second 
approach, based on data fusion, is described in Section 4, and its results are 
discussed in Section 5. Finally, our conclusions and future work are presented in 
Section 6. 

2 Locality-Based IR 

2.1 The Retrieval Model 

In the document-based retrieval model prevalent nowadays, an IR system 
retrieves a list of documents ranked according to their degree of relevance 
with respect to the information need of the user. In contrast, a locality-based 
IR system goes one step further, and looks for the concrete locations in the 
documents which are relevant to such a need. Passage retrieval [10] could be 
considered as an intermediate point between these two models, since its aim is to 
retrieve portions of documents (called passages) relevant to the user. However, 
passage retrieval is closer to document-based than to locality-based retrieval: 
once the original documents have been split into passages they are ranked using 
traditional similarity measures. In this case, the main difficulty comes from 
specifying what a passage is, including considerations about size and overlapping 
factors, and how they can be identified. 

In contrast, the locality-based model considers the collection to be indexed 
not as a set of documents, but as a sequence of words where each occurrence 
of a query term has an influence on the surrounding terms. Such influences 
are additive, thus, the contributions of different occurrences of query terms are 
summed, yielding a similarity measure. As a result, those areas of the texts with 
a higher density of query terms, or with important query terms, show peaks in 
the resulting graph, highlighting those positions of the text which are potentially 
relevant with respect to the query. A graphical representation of this process is 
shown in Fig. 1. It is worth noting that relevant portions are identified without 
the need to perform any kind of splitting in the documents, as is done in passage 
retrieval. 

Next, we describe the original proposal of de Kretser and Moffat for the 
locality-based model [5,6]. 

2.2 Computing the Similarity Measure 

In the locality-based model the similarity measure only needs to be computed 
for those positions of the text in which query terms occur, a characteristic which 
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Fig. 1. Computing the similarity measure in a locality-based model: (a) positions where 
query terms occur and their regions of influence; (b) the resultant similarity curve 



makes its application possible in practical environments due to its computational 
cost being relatively low. 

The contribution to the similarity graph of a given query term is determined 
by a similarity contribution function Ct defined according to the following 
parameters [5]: 

— The shape of the function, which is the same for all terms. 

— The maximum height ht of the function, which occurs in the position of the 
query term. 

— The spread St of the function, that is, the scope of its influence. 

— The distance, in words, with respect to other surrounding words, d = \x — l\, 
where I is the position of the query term and x is the position of the word 
in the text where we want to compute the similarity score. 



Several function shapes are described in [5], but we only show here those 
with which we obtained better results in Spanish. They are the triangle (tri) 
and the circle (cir) function, defined by equations 1 and 2, respectively, and 
whose graphical representation is shown in Fig. 2: 



ct(x, 1) = ht{l — d/st) ■ 
ct{x,l) = ht\/l - {d/stY ■ 



( 1 ) 

(2) 



with ct{x, Z) = 0 when \x — l\ > St- 

The height ht of a query term t is defined as an inverse function of its 
frequency in the collection: 
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Fig. 2. Shapes of the similarity contribution function ct 



ht = Ut loge{N/ft) . (3) 

where N is the total number of terms in the collection, ft is the number of times 
term t appears in the collection, and fq^t is the within-query frequency of the 
term. 

On the other hand, the spread St of the influence of a term t is also defined as 
an inverse function of its frequency in the collection, but normalized according 
to the average term frequency: 



n N n 



(4) 



where n is the number of unique terms in the collection, that is, the size of the 
vocabulary. 

Once these parameters have been fixed, the similarity score assigned to a 
location x of the document in which a term of the query Q can be found is 
calculated as: 

Cq{x)^Y^ ct{x,l) . (5) 

tGQ iGit 

\l — x\<st 

term{x)^term{l) 

where It is the set of word locations at which a term t of the query Q occurs, 
and where term{w) represents the term associated to the location w. In other 
words, the degree of similarity or relevance associated with a given location is 
the sum of all the influences exerted by the rest of query terms within whose 
spread the term is located, excepting other occurrences of the same term that 
exist at the location examined [6]. 

Finally, the relevance score assigned to a document D is given in function of 
the similarities corresponding to occurrences of query terms that this document 
contains. This point is discussed in detail below. 



2.3 Adaptations of the Model 

The locality-based model not only identifies the relevant documents but also 
the relevant locations they contain, allowing us to work at a more detailed level 
than classical IR techniques. Thus, we have opted for using this model in our 
experiments. Nevertheless, before doing so, the model had to be adapted to our 
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needs, which makes our approach different from the original proposal of the 
model [5,6]. 

The approach we have chosen for integrating distance-based similarity in our 
IR system consists of postprocessing the documents obtained by a document- 
based retrieval system. This initial set of documents is obtained through a base 
IR system - we name it lem - which employs content-word lemmas (nouns, 
adjectives and verbs) as index terms. This list of documents returned by lem is 
then processed using the locality-based model, taking the final ranking obtained 
using distance-based similarity as the final output to be returned to the user. 

It should be pointed out that the parameters of height, ht, and spread, St, 
employed for the reranking are calculated according to the global parameters of 
the collection, not according to the parameters which are local to the subset of 
documents returned, in order to avoid the correlation-derived problems it would 
introduce^. 

Another aspect in which our approach differs from the original model is the 
employment of lemmatization, instead of stemming, for conflating queries and 
documents. We have made this choice due to the encouraging results previously 
obtained with such an approach, with respect to stemming, in the case of 
Spanish [20]. 

The third point of difference corresponds to the algorithm for calculating the 
relevance of a document, obtained from the similarity scores of its query term 
occurrences. Instead of the original iterative algorithm [5], our approach defines 
the similarity score sim{D, Q) of a document D with respect to a query Q as 
the sum of all the similarity scores of the query term occurrences it contains: 

sim{D, Q) = E • (6) 

x^D 

term{x)^Q 



3 Experimental Results Using Distances 

Our approach has been tested using the Spanish monolingual corpus of the 2001 
and 2002 CLEF editions [15], composed of 215,738 news reports provided by 
FEE, a Spanish news agency. The 100 queries employed, from 41 to 140, consist 
of three fields: a brief title statement, a one-sentence description, and a more 
complex narrative specifying the relevance assessment criteria. 

As mentioned in Sect. 2.3, the initial set of documents to be reranked is 
obtained through the indexing of content word lemmas (lem). For this purpose, 
the documents were indexed with the vector-based engine smart [3], using the 
atn-ntc weighting scheme. In order to improve the performance of the whole 
system, we have tried to obtain the best possible starting set of documents by 
applying pseudo-relevance feedback (blind-query expansion) adopting Rocchio’s 
approach [16]: 

^ For example, the parameter ft, corresponding to the number of occurrences of a 
term t, is the number of occurrences of t in the entire collection, not the number of 
occurrences of t in the set of documents to be reranked. 
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Table 1. Reranking based on distances 





stm 


short queries 
lem tri cir 


1 stm 


long queries 
lem tri 


cir 


Documents 


99k 


99k 


99k 


99k 


99k 


99k 


99k 


99k 


Relevant (5548 expected) 


5086 


5207 


5207 


5207 


5208 


5234 


5234 


5234 


Non-interpolated precision 


.5210 


.5235 


.4473 


.4464 


.5638 


.5648 


.4802 


.4703 


Document precision 


.5502 


.5814 


.5154 


.5188 


.5925 


.6038 


.5366 


.5376 


R-precision 


.4952 


.4978 


.4438 


.4453 


.5316 


.5335 


.4574 


.4490 


Precision at .00 recall 


.8426 


.8260 


.8402 


.8394 


.9028 


.8788 


.8771 


.8639 


Precision at .10 recall 


.7294 


.7431 


.7551 


.7533 


.7910 


.7989 


.8167 


.8022 


Precision at .20 recall 


.6746 


.6936 


.6550 


.6624 


.7326 


.7420 


.7070 


.6909 


Precision at .30 recall 


.6135 


.6380 


.5764 


.5806 


.6763 


.6887 


.6066 


.5996 


Precision at .40 recall 


.5812 


.5900 


.5045 


.5052 


.6401 


.6499 


.5417 


.5314 


Precision at .50 recall 


.5470 


.5520 


.4496 


.4515 


.5975 


.6058 


.4894 


.4819 


Precision at .60 recall 


.5078 


.5099 


.3882 


.3850 


.5452 


.5502 


.4184 


.4045 


Precision at .70 recall 


.4518 


.4498 


.3360 


.3340 


.4816 


.4816 


.3654 


.3547 


Precision at .80 recall 


.3882 


.3796 


.2750 


.2692 


.4056 


.4022 


.3042 


.2929 


Precision at .90 recall 


.3044 


.2923 


.1933 


.1917 


.3356 


.3150 


.2023 


.1944 


Precision at 1.0 recall 


.1897 


.1756 


.1031 


.1014 


.2054 


.1918 


.1062 


.1000 


Precision at 5 docs 


.6182 


.6182 


.6141 


.6121 


.6808 


.6747 


.6667 


.6606 


Precision at 10 docs 


.5717 


.5758 


.5596 


.5596 


.6182 


.6202 


.5929 


.5869 


Precision at 15 docs 


.5279 


.5380 


.5111 


.5192 


.5670 


.5798 


.5441 


.5394 


Precision at 20 docs 


.4965 


.5071 


.4803 


.4818 


.5338 


.5556 


.5081 


.5056 


Precision at 30 docs 


.4434 


.4582 


.4259 


.4229 


.4822 


.5030 


.4545 


.4566 


Precision at 100 docs 


.2935 


.3016 


.2691 


.2696 


.3119 


.3171 


.2811 


.2812 


Precision at 200 docs 


.1937 


.2002 


.1863 


.1875 


.2053 


.2060 


.1926 


.1932 


Precision at 500 docs 


.0945 


.0981 


.0964 


.0964 


.0981 


.0985 


.0979 


.0982 


Precision at 1000 docs 


.0514 


.0526 


.0526 


.0526 


.0526 


.0529 


.0529 


.0529 



Qi 



"1 

<yQo + ^ ^ 

fc=i 



Rk 

ni 



yE 



fc=i 



«2 



(7) 



where Qi is the new query vector, Qo is the vector of the initial query, Rk is 
the vector of relevant document k, Sk is the vector of non-relevant document 
k, rii is the number of relevant documents, rz 2 is the number of non-relevant 
documents, and a, f3 and 7 are, respectively, the parameters that control the 
relative contributions of the original query, relevant documents, and non-relevant 
documents. Our system expands the initial query automatically with the best 10 
terms of the 5 top ranked documents, and using a = 1.40, (3 = 0.10 and 7 = 0 . 

It should be pointed out that the distance-based reranking process is 
performed according to the terms of the original query, without taking into 
account the terms added during the feedback. This is because there is no 
guarentee that these terms were syntactically related with the original query 
terms, since they only co-occur in the documents with such terms. 

Two series of experiments have been carried out. Firstly, employing queries 
obtained from the title and description fields - short queries - and, secondly, 
employing queries obtained from the three fields, that is title, description and 
narrative - long queries. It should be noticed that in the case of long queries, 
the terms extracted from the title field are given double relevance with respect 
to description and narrative, since the former summarizes the basic semantics of 
the query. 
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The results obtained are shown in Table 1. The first column of each group 
shows the results obtained through a standard approach based on stemming 
(stm), also using pseudo-relevance feedback; the second column contains the 
results of the indexing of lemmas (lem) before the reranking, our baseline; the 
two other columns show the results obtained after reranking lem by means of 
distances employing a triangle (tri) and circle (czr) function. 

The performance of the system is measured using the parameters contained 
in each row: number of documents retrieved, number of relevant documents 
retrieved (5548 expected), average precision (non-interpolated) for all relevant 
documents (averaged over queries), average document precision for all relevant 
documents (averaged over relevant documents), R-precision, precision at 11 
standard levels of recall, and precision at N documents retrieved. For each 
parameter we have marked in boldface those values where there is an 
improvement with respect to the baseline lem. 

As these results show, reranking through distances has caused a general 
drop in performance, except for low recall levels, where results are similar or 
sometimes even better. We can therefore conclude that this first approach is of 
little practical interest. 

4 Data Fusion Through Intersection 

4.1 Analysis of Results 

Since the set of documents retrieved by the system is the same, the drop in 
performance in this first approach can only be caused by a worse ranking of 
the results because of the application of the distance-based model, and for this 
reason we decided to analyze the changes in the distribution of relevant and 
non-relevant documents in the K top retrieved documents. The results obtained 
in the case of using short queries and the triangle function (tri) are shown in 
Table 2. Changes in the type of query, short or long, or in the shape of the 
function, triangle or circle, has little effect on these results and the conclusions 
that can be inferred from them. 

Each row contains the results obtained when comparing the K top documents 
retrieved by lem (set of results L), with those K top documents retrieved 
after their reranking using distances (set of results D). The columns show 
the results obtained for each of the parameters considered: average number of 
new relevant documents obtained through distances {D \ L), average number 
of relevant documents lost using distances {L\ D), average number of relevant 
documents preserved (LDD), overlap coefficient for relevant documents (Rover), 
precision of lem at K top documents (Pr(L)), precision at K top documents 
after reranking through distances (Pr(D)), precision for the documents common 
to both approaches in their K top documents (Pr(Lr)D)). The right-hand side of 
the table shows their equivalents for the case of non-relevant documents: average 
number of non-relevant documents added, lost and preserved, together with their 
degree of overlap. 
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Table 2. Document distribution (short queries - triangle function) 



K 


D\L 


L\D 


Ln D 


relevant docs. 
Rover 1 Pr(L) 


Pr{D) Pr(L n D)\ 


non-relevant docs 
1 D\L L\D LnD 


^ over 


5 


1.60 


1.62 


1.44 


0.47 


0.61 


0.61 


0. 77 


1.54 


1.52 


0.42 


0.22 


10 


2.73 


2.89 


2.81 


0.50 


0.57 


0.55 


0.68 


3.14 


2.98 


1.32 


0.30 


15 


3.37 


3.77 


4.22 


0.54 


0.53 


0.51 


0.65 


5.13 


4.73 


2.28 


0.32 


20 


3.92 


4.44 


5.59 


0.57 


0.50 


0.48 


0.63 


7.24 


6.72 


3.25 


0.32 


30 


4.66 


5.62 


7.99 


0.61 


0.45 


0.42 


0.59 


11.84 


10.88 


5.51 


0.33 


50 


5.95 


9.16 


20.69 


0.73 


0.60 


0.53 


0.42 


45.04 


41.83 


28.32 


0.39 


100 


5.10 


7.84 


31.78 


0.83 


0.40 


0.37 


0.30 


88.13 


85.39 


74.99 


0.46 


200 


1.99 


2.82 


45.72 


0.95 


0.24 


0.24 


0.14 


164.28 


163.45 


288.01 


0.64 



Several important facts can be observed in these figures. Firstly, that the 
number of relevant documents retrieved by both approaches in their K top 
documents is very similar - a little smaller for distances - , as can be inferred 
from the number of incoming and outgoing relevant documents, and from the 
precisions at the top K documents of both approaches. This confirms that the 
problem has its origin in a bad reranking of the results. 

The second point we need to point out refers to the overlap coefficients of 
both relevant (Rover) and non-relevant (Nover) documents. These coefficients, 
defined by Lee in [12], show the degree of overlap among relevant and non- 
relevant documents in two retrieval results. For two runs rurii and run 2 , they 
are defined as follows: 

_ 2 \Rel(runi) C\ Rel(run 2 )\ 

“ \Rel(runi)\ + \Rel{run 2 )\ ' 



2 \Nonrel(runi) C\ Ncmrel{run 2 )\ , , 

\Nonrel(runi)\ -|- \Nonrel(run 2 )\ 

where Rel(X) and N onrel(X) represent, respectively, the set of relevant and 
non-relevant documents retrieved by the run X. 

It can be seen in Table 2 that the overlap factor among relevant documents 
is much higher than among non-relevant documents. Therefore, it obeys the 
unequal overlap property [12], since both approaches return a similar set on 
relevant documents, but a different set on non-relevant documents. This is a 
good indicator of the effectiveness of fusion of both runs. 

Finally, and also related with the previous point, the figures show that 
the precision for the documents common to both approaches in their K top 
documents (Pr(L HD)) is higher than the corresponding precisions for lemmas 
(Pr(L)) and distances (Pr{D)); that is, the probability of a document being 
relevant is higher when it is retrieved by both approaches. In other words, the 
more runs a document is retrieved by, the higher the rank that should be assigned 
to the document [17]. 

According to these observations, we decided to take a new approach for 
reranking, this time through data fusion, by combining the results obtained 
initially with the indexing of lemmas with the results obtained when they are 
reranked through distances. Next, we describe this approach. 
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4.2 Description of the Algorithm 

Data fusion is a technique of combination of evidences that consists of combining 
the results retrieved by different representations of queries or documents, or by 
different retrieval techniques [7, 12,4]. 

In our data fusion approach, we have opted for using a boolean criterion 
instead of combining scores based on similarities [7, 12] or ranks [12]. 

Once the value K is set, the documents are retrieved in the following order: 

1. First, the documents contained in the intersection of the top K documents 
retrieved by both approaches: Lk^D^- Our aim is to increase the precision 
of the top documents retrieved. 

2. Next, the documents retrieved in the top K documents by only one of 
the approaches: {L^ U Dk) \ {Lk H Dk)- Our aim is to add to the top 
of the ranking those relevant documents retrieved only by the distance- 
based approach at its top, but without harming the ranking of the relevant 
documents retrieved by the indexing of lemmas. 

3. Finally, the rest of documents retrieved using lem: L \ {Lk U Dk)- 

where L is the set of documents retrieved by lem, Lk is the set of the top K 
documents retrieved by lem, and Dk is the set of the top K documents retrieved 
by applying distances. 

With respect to the internal ranking of the results, we will take the ranking 
obtained with lem as reference, because of its better behavior. In this way, when 
a subset S of results is retrieved, they will be retrieved in the same relative order 
they had when they were retrieved by lem^ . 

5 Experimental Results with Data Fusion 

After a previous phase of tuning of A, in which different values of K were tested^, 
a value A = 30 was chosen as the best compromise, since although lower values 
of A showed peaks of precision in the top documents retrieved, their global 
behavior was worse. 

Table 3 shows the results obtained with this new approach. Column tri shows 
the results obtained by means of the fusion through intersection of the set of 
documents initially retrieved with lem with the documents retrieved by applying 
reranking through distances using a triangle function. The results corresponding 
to the circle function are showed in cir. 

The improvements attained with this new approach - in boldface - are 
general, particularly in the case of the precision at N documents retrieved. 
Moreover, there are no penalizations for non-interpolated precision and R- 
precision. 

^ That is, if the original sequence in lem was d2-dS-dl and a subset {dl,d3\ is going 
to be returned, the documents should be obtained in the same relative order as in 
the original results: d3-dl. 

® A e {5, 10, 15, 20, 30, 50, 75, 100, 200, 500}. 
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Table 3. Reranking through data fusion; K=30 





stm 


short queries 
lem tri cir 


1 stm 


long queries 
lem tri 


cir 


Documents 


99k 


99k 


99k 


99k 


99k 


99k 


99k 


99k 


Relevant (5548 expected) 


5086 


5207 


5207 


5207 


5208 


5234 


5234 


5234 


Non-interpolated precision 


.5210 


.5235 


.5204 


.5206 


.5638 


.5648 


.5654 


.5647 


Document precision 


.5502 


.5814 


.5829 


.5836 


.5925 


.6038 


.6083 


.6094 


R-precision 


.4952 


.4978 


.4911 


.4911 


.5316 


.5335 


.5311 


.5306 


Precision at .00 recall 


.8426 


.8260 


.8424 


.8428 


.9028 


.8788 


.8871 


.8901 


Precision at .10 recall 


.7294 


.7431 


.7520 


.7522 


.7910 


.7989 


.8052 


.8075 


Precision at .20 recall 


.6746 


.6936 


.7043 


.7059 


.7326 


.7420 


.7501 


.7496 


Precision at .30 recall 


.6135 


.6380 


.6434 


.6447 


.6763 


.6887 


.6975 


.6983 


Precision at .40 recall 


.5812 


.5900 


.5967 


.5965 


.6401 


.6499 


.6577 


.6595 


Precision at .50 recall 


.5470 


.5520 


.5447 


.5454 


.5975 


.6058 


.6092 


.6141 


Precision at .60 recall 


.5078 


.5099 


.4997 


.4999 


.5452 


.5502 


.5443 


.5362 


Precision at .70 recall 


.4518 


.4498 


.4325 


.4282 


.4816 


.4816 


.4729 


.4644 


Precision at .80 recall 


.3882 


.3796 


.3665 


.3653 


.4056 


.4022 


.3929 


.3885 


Precision at .90 recall 


.3044 


.2923 


.2846 


.2857 


.3356 


.3150 


.3045 


.3036 


Precision at 1.0 recall 


.1897 


.1756 


.1687 


.1684 


.2054 


.1918 


.1862 


.1857 


Precision at 5 docs 


.6182 


.6182 


.6303 


.6343 


.6808 


.6747 


.6929 


.6949 


Precision at 10 docs 


.5717 


.5758 


.5929 


.5970 


.6182 


.6202 


.6525 


.6495 


Precision at 15 docs 


.5279 


.5380 


.5522 


.5542 


.5670 


.5798 


.5993 


.5980 


Precision at 20 docs 


.4965 


.5071 


.5217 


.5207 


.5338 


.5556 


.5672 


.5646 


Precision at 30 docs 


.4434 


.4582 


.4582 


.4582 


.4822 


.5030 


.5030 


.5030 


Precision at 100 docs 


.2935 


.3016 


.3040 


.3044 


.3119 


.3171 


.3182 


.3193 


Precision at 200 docs 


.1937 


.2002 


.2006 


.2008 


.2053 


.2060 


.2064 


.2064 


Precision at 500 docs 


.0945 


.0981 


.0982 


.0982 


.0981 


.0985 


.0987 


.0987 


Precision at 1000 docs 


.0514 


.0526 


.0526 


.0526 


.0526 


.0529 


.0529 


.0529 



6 Conclusions and Future Work 

In this article we have proposed the use of a distance-based retrieval model, also 
called locality-based, which allows us to face the problem of syntactic linguistic 
variation in text conflation employing a pseudo-syntactic approach. 

Two approaches were proposed for this purpose, both based on reranking the 
results obtained by indexing content word lemmas. The first approach, where 
the ranking obtained by means of the application of the locality-based model 
is the final ranking to be retrieved, did not get, in general, good results. After 
analyzing the behavior of the system, a new approach was taken, this time based 
on data fusion, which employs the intersection of the sets of documents retrieved 
by both approaches as reference for the reranking. This second approach was 
fruitful, since it obtained consistent improvements in the ranking at all levels, 
without harming other aspects. 

With respect to future work, several aspects should be studied. Firstly, we 
intend to extend our experiments to other retrieval models apart from the vector 
model, in order to test its generality. Secondly, we aim to improve the system 
by managing not only syntactic variants but also morphosyntactic variants [9] . 

Two new applications of this locality-based approach are also being 
considered. Firstly, in Query Answering, where it will in all probability prove 
most useful, since this distance-based model allows us to identify the relevant 
locations of a document, which probably contain the answer, with respect to the 
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query. Once the relevant locations are identified, the answer would be extracted 
through further in-depth linguistic processing. Secondly, its possible application 
in query expansion through local clustering based on distances [2] is also being 
studied. 
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Abstract. In this paper we propose a new data structure for the efficient extrac- 
tion of structured motifs from DNA sequences. A structured motif is defined as 
a collection of highly conserved motifs with pre-specified sizes and spacings be- 
tween them. The new data structure, called box-link, stores the information on 
how to jump over the spacings which separate each motif in a structured motif. A 
factor tree, a variation of a suffix tree, endowed with box-links provide the means 
for the efficient extraction of structured motifs. 



Structured motifs try to capture highly conserved complex regions in a set of DNA 
sequences which, in the case of sequences from co-regulated genes, model functional 
combinations of transcription factor binding sites [1-3]. Formally, a motif is. a non- 
empty string over an alphabet S (e.g., S ={A,C,T,G} for DNA sequences). A struc- 
tured motif [1] is a pair (m, d) where m is a p-tuple of motifs denoting p 

boxes, and d is a (p — l)-tuple of pairs (dmitii, dniaxi)i<i<p, denoting p — I intervals 
of distance. In the following, we consider that all p boxes of a structured motif have a 
fixed length k and a fixed distance between boxes d. The general case was studied but is 
out of the scope of this abstract. Algorithms and complexity results are easily adaptable 
to the more general case. 

A factor tree, also called a k-f actor tree [4], is a data structure that indexes the 
factors of a string whose length does not exceed k. In the following we define box- 
links, whose purpose is to store the information needed to jump from box to box in a 
structured motif, over a factor tree. Formally, let L be the set of leaves at depth fc of a 
/c-factor tree T for a string s of length n and L\. denote all possible f -tuples over L. A 
box-link of size i, with 1 < f < p, is a (z -b l)-tuple in such that there is a substring 
s' of s where: (i) the length of s' is ik-\- {i— l)d; (ii) the fc-length substring of s' ending 
at position jk -\- {j — l)d, with 1 < j < z, is the path in T spelled from the root to the 
j-th leaf of the box-link tuple. Box-links can be used to extract structured motifs when 
built over a generalized factor tree (a factor tree for a set of N sequences). However, 
in this case, box-links have to be endowed with a Colors Boolean array [1] in order to 
distinguish in which of the N input sequences the corresponding boxes are linked. 

In the following, we present an algorithm to build box-links. The algorithm makes 
use of two variables. First, the variable fzsf/ea/ has the list of all leaves inserted in the 

A. Apostolico and M. Melucci (Eds.): SPIRE 2004, LNCS 3246, pp. 267-268, 2004. 
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factor tree, which can be easily obtained during the factor tree construction. In fact, for 
the sake of exposition, listieaf can be seen as a family of variables {listieaf^)i<i<N, 
where each listieaf - has average length n, the average length of an input sequence. 
Observe that the substring labeling the path from the root to the j-th leaf of listieaf. 
corresponds to the j-th at most fc-length substring of the z-th input string. Second, the 
variable bj stores the j-size box-links being built. We now describe AddBoxLink func- 
tion. AddBoxLink(6,u,i) adds a box-link between an existing (j — l)-size box-link b 
and a leaf v for the i-th input sequence. However, it only creates a new box-link if 
there is not already a box-link between box-link b and node v. In either way, creating or 
not a new box-link, the AddBoxLink function sets the Boolean array entry z to 1. The 
pseudo-code of the algorithm to build box-links is presented in Algorithm 1 . 



Algorithm 1 BoxLink(Boxes p, BoxSize k, BoxDistance d, ListLeaf listieaf ) 

1. for i from 1 to A 

2. while size of listieaf ^ > pk + {p — l)d 

3. &o = AddBoxLink (nil, listieaf ^ [0], i) 

4. for j from 1 to p — 1 

5. bj = AddBoxLink(bj_i , listuaf^ [jk + jd] , i) 

6. remove the first leaf of listuafi 



Next, we establish the complexity for Algorithm 1 . Let ni be the number of nodes 
at depth I of the generalized suffix tree for the same input sequences as the factor tree 
where the box-links are being constructed, and bp{k, d) = min{n^, npk+(p-i)d} ■ 

Proposition 1. Algorithm 1 takes 0{N^np) time and 0{Nbp{k, d)) space. 

Proof. Step 1, 2 and 4 require 0{N), 0{n) and 0{p) time, respectively. Step 5 re- 
quires 0{N) time, which corresponds to the creation or update of Colors array. Hence, 
Algorithm 1 takes 0{N‘^np) time. The space complexity is given by the number of 
box-links, which can be upper bounded by bp{k, d), times its size, which is N. □ 

The use of box-links achieves a time and space exponential gain, in the worst case 
analysis, over approaches in [1]. Time improvement is obtained because the information 
required to jump from box to box in a structured motif is memorized and accessed very 
rapidly with box-links. Moreover, it is only required to build a ^-factor tree, instead of 
a full suffix free, or a pk + {p — l)c?-facfor free, which leads fo imporfanf space savings. 
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Abstract. An algorithm to compute the exact degree of balancedness in 
the output sequence of a LFSR-based generator (either nonlinear hlters 
or combination generators) has been developed. 

Keywords: Balancedness, bit-string algorithm, computational logics 



1 Introduction 

Generators of binary sequences based on Linear Feedback Shift Registers (LF- 
SRs) [1] are electronic devices widely used to generate pseudorandom sequences 
in many different applications. The pseudorandom sequence is generated as the 
image of a nonlinear Boolean function F in the LFSR’s stages. Balancedness in 
the generated sequence is a necessary condition that every LFSR-based genera- 
tor must satisfy. Roughly speaking, a periodic binary sequence is balanced when 
the number of l^s and the number of O^s in a period are as equal as possible. Due 
to the long period of the sequences produced by LFSR-based generators, it is 
not possible to generate the whole sequence and then to count the number of IG 
and O's. Therefore, in practical design of binary generators, statistical tests are 
applied to segments of the output sequence just to obtain probabilistic evidence 
that a generator produces a balanced sequence. In the present work, balanced- 
ness of pseudorandom sequences has been treated in a deterministic way. In fact, 
an algorithm to compute the exact number of I's and O's in the output sequence 
of a LFSR-based generator has been developed. To our knowledge, this is the 
first algorithm to perform this task. The algorithm input is the particular form 
of the generating function F while the algorithm output is the number of I's 
in the generated sequence (as the period is known so is the number of O's). In 
this way, the degree of balancedness of the output sequence can be perfectly 
checked. The algorithm that is based on a L-bit string representation has been 
mainly applied to nonlinear filters (high-order functions F with a large number 
of terms); its generalization to combination generators (low-order functions F 
with a short number of terms) is just the simplification of the process. 

* Work supported by Ministerio de Ciencia y Tecnologla (Spain) under grant TIC 
2001-0586. 
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2 Efficient Computation of Balancedness 

Let be a Boolean function in Algebraic Normal Form whose input variables 
Gi {i = 0, L—1) are the binary contents of the L stages of a LFSR. The func- 
tion <P{F) is defined as a Boolean function that substitutes each term aj ... a^n 
of F for its corresponding minterm Aij Each term of F is the logic product 
of LFSR stages aiaj ...am = aij...m = Oq as well as each term of is no- 
tated Aij,,,m = Aa. Thus, the nonlinear functions F and can be written as 
F = ^ Octfc and <P = ^ A^^ , respectively, where the symbol 0 represents the 
© © 

exclusive-OR sum. In addition, the minterm A^^. has in total terms, 

d (ofe) being the number of indexes in In order to implement this algorithm, 
every minterm Aq, is represented by a L-bit string numbered 0, 1, ..., L — 1 from 
right to left. If the n-th index is in the set a (n G a), then the n-th bit of such 
a string takes the value 1; otherwise, the value will be 0. Thus, d{a) equals the 
number of I's in the L-bit string that represents A^ . We call maximum common 
development of two minterms Aa and A^, notated MD (Aa, A^), to the minterm 
A^ such that y = a U /3. Under the L-bit string representation of the minterms, 
MD can be realized by means of a bit-wise OR operation between the binary 
strings of both functions. MD represents all the terms that Aa and A^ have in 
common. 

Let F = X) be a nonlinear Boolean function of N terms 

© 

applied to a L-stage LFSR. In order to compute the number of I's (notated Up) 
in the generated sequence, the following algorithm is introduced: 

— Step 1: Define the function ^ from the N terms Oa^. of F. Initialize the 

function FI with a null value, iJo = 0- 

— Step 2: For fc = 1 ... N: Hk = H^-i 0 A„, - 2 • MD (A„,, Dfc_i). 

— Step 3: From the final form of iLjv = X compute the number of l^s 

3 

in the generated sequence by means of the expression Up = ^sj ■ 

3 

The calculations were performed on a simple PC computer (CPU Intel Xeon 
2.8 GHz, 1 Gb of RAM) working with a Linux operative system. More than 40 
different nonlinear functions F, each of them including 50 terms generated at 
random, were applied to a LFSR of L = 32. Numerical results prove that high 
performance computers are not needed in order to run the algorithm. In fact, 
the worse execution time obtained from one of the tested functions was less than 
11 hours. Based on these implementations, the algorithm is believed to be a 
useful tool to calculate the exact degree of balancedness in sequences produced 
by LFSR-based generators. 
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Abstract. A major issue when defining the efficiency of a spelling 
corrector is how far we need to examine the input string to validate the 
repairs. We claim that regional techniques provide a performance and 
quality comparable to that attained by global criteria, with a significant 
saving in time and space. 



1 Introduction 

Although a lot of effort has gone into the problem of spelling correction over 
the years, it remains a research challenge. In particular, we are talking about 
a critical task in natural language processing applications for which efficiency, 
safety and maintenance are properties that cannot be neglected. 

Most correctors assist users by offering a set of candidate repairs. So, any 
technique that reduces the number of candidates for correction will show an 
improvement in efficiency that should not have side effects on safety. Towards this 
aim, we focus on limiting the size of the repair region [2] , in contrast to previous 
global proposals [1]. Our goal now is to evaluate our proposal, examining the 
error context to later validate repairs by tentatively recognizing ahead, avoiding 
cascaded errors and corroborating previous theoretical results. 

2 Asymptotic Behavior 

We introduce some preliminary tests illustrating that our proposal provides 
a quality similar to that of global approaches with a significant reduction in 
cost, only equivalent to that provided by global approaches in the worst case. 
To do it, we choose to work with Spanish, a language with a highly complex 
conjugation paradigm, gender and number inflection. The lexicon has 514.781 
words, recognized by a, finite automaton (fa) containing 58.170 states connected 
by 153.599 transitions, from which we have selected a representative sample 

* Research partially supported by the Spanish Government under projects TIC2000- 
0370-C02-01 and HP2002-0081, and the Autonomous Government of Galicia under 
projects PGIDIT03SIN30501PR and PGIDIT02SIN01E. 
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that follows the length distribution of the words in the lexicon. For each length- 
category, a random number of errors have been generated in random positions. 

We compare our proposal with the Savary’s global approach [1] , to the best of 
our knowledge, the most efficient method of error-tolerant look-up in finite-state 
dictionaries. We consider the set of calculations associated to a transition in the 
FA, that we call item, as the unit to measure the computational effort. Finally, 
the precision will reflect when the correction attended by the user is provided. 
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Fig. 1. Number of items generated in error mode. 



Some preliminary results are compiled in Fig. 1. The graphic illustrates 
our contribution from two viewpoints. First, our proposal shows a linear-like 
behavior, in contrast to the Savary’s approach that seems to be of exponential 
type, resulting in an essential property: the independence of the time of response 
on the initial conditions for the repair process. Second, the number of items is 
significantly reduced when we apply our regional criterion. These tests provided 
a precision of 77% (resp. 81%) for the regional (resp. global) approach. The 
integration of linguistic information should reduce this gap, less than 4%, or 
even eliminate it. In effect, our regional approach only takes now into account 
morphological information, which has an impact in the precision, while a global 
technique always provides all the repair alternatives without exclusion. 
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Abstract. In this paper, we present two new algorithms for discovering 
monad patterns in DNA sequences. Monad patterns are of the form {l,d)~ 
k, where I is the length of the pattern, d is the maximum number of 
mismatches allowed, and k is the minimum number of times the pattern 
is repeated in the given sample. The time-complexity of some of the best 
known algorithms to date is where t is the number of input 

sequences, n is the length of each input sequence, and u = | ^ | is the size 
of the alphabet. The hrst algorithm that we present in this paper takes 
0{'n?t^l^) time and 0{nW^ a~^) space, and the second algorithm takes 
) time using ) space. In practice, our algorithms have 

much better performance provided the d/l ratio is small. The second 
algorithm performs very well even for large values I and d as long as the 
d/l ratio is small. 

1 Introduction 

Discovering regulatory patterns in DNA sequences is a well known problem in 
computational biology. Due to mutations and other errors, the actual occurrences 
of these regulatory patterns allow for a certain degree of error. There fore, the 
actual regulatory pattern (or the consensus pattern) may never appear in a gene 
upstream region, but d-mismatch occurrences of this pattern might appear. The 
general approach to this problem is to take a set of t DNA sequences each of 
length n, at least k of which are guaranteed to contain the desired binding site, 
and look for patterns of a certain length I that occur in at least k out of the t 
sequences with at most d mismatches at each occurrence. The values of I, d and 
k can be determined either from prior knowledge about the binding site, or by 
trial and error, trying different values of I and d. These single contiguous blocks 
of patterns are called monad patterns. 

In general, many regulatory signals are made up of a group of monad pat- 
terns occurring within a certain distance form each other [EskKGPOS, EskP02, 
GuhSOl, vanRGOO]. In such a case, the patterns are called dyad, triad, multi-ad, 
or in general as composite patterns. Finding the composite patterns by finding 
the component monad patterns individually is significantly more difficult, since 

* This research was partially supported by NSF grant number: ITR-0312724. 
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Fig. 1. A pattern P that is consistent with an Z-gram pair (Li,Lj) 



the composite monad patterns might be too subtle to detect. Eskin and Pevzner 
[EskP02] present a simple transformation to convert a multi-ad problem into a 
slightly larger monad problem. In this paper, we present an algorithm to solve 
the monad-pattern finding problem. The same transformation as in [EskP02] 
can be applied to transform a multi-ad problem into a monad problem that is 
handled by our algorithm. 

Pevzner and Sze [PevSOO] have put forward a challenge problem: to find the 
signal in a sample of t = 20 sequences, each 600 nucleotides long, each contain- 
ing an unknown pattern of length / = 15 with at most cZ = 4 mismatches. They 
presented the WINNOWER and SP-STAR algorithms that could solve this prob- 
lem, which was not solvable by many of the earlier techniques. Many other ap- 
proaches that can solve this problem have been proposed [Sag98, EskP02, LiaOS, 
BuhT2001]. Time-complexity of the best known algorithms [Sag98, EskP02] is 

Many of the above algorithms search the d-mismatch neighborhood of each 
Z-gram in the sample. The size of the d-mismatch neighborhood of an Z-gram 
in 0(Z‘^cr^). The main motivation for our algorithms is that in most practical 
scenarios, it might be possible to limit the search to a small portion of the d- 
mismatch neighborhood. We refer to the set of patterns that mismatch in at 
most d positions with two Z-grams as the consistent patterns of the two Z-grams. 
We denote the distance(the number of mismatches) between two Z-grams Li and 
Lj by D{Li, Lj). The distance relationships between two Z-grams Li and Lj and 
a pattern P that is consistent with both of them are shown in Figure 1. The 
following observations form the basis for our algorithm: 

Observation 1: For each Z-gram, it is sufficient to search the consistent pat- 
terns of the Z-gram with respect to all other Z-grams. 

Observation 2: The number of other Z-grams in the sample that are within h 
mismatches from the current Z-gram reduces rapidly with decreasing h. This is 
illustrated in Figure 2-(a) for a random sample of 20 sequences of 600 nucleotides 
each. The size of the average 2d-mismatch neighborhood is 571.395, where as the 
average size of the d-mismatch neighborhood is just 1.23. 

Observation 3: The number of consistent patterns between two Z-grams which 
mismatch in h positions decreases rapidly with increasing h. When h is greater 
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Fig. 2. Variation of: (a)/i-mismatch neighborhood (b) consistent patterns with h 



than 2d, this number is zero, as two Z-grams that mismatch in more than 2d 
positions can not have any patterns that mismatch with both of them in at most 
d positions. Therefore, as is illustrated in Figure 2-(b), the number of consistent 
patterns between two Z-grams which mismatch in more than d positions is quite 
small. 

2 Previous Approaches to Pattern Discovery 

The pattern discovery problem can be formally stated as follows: Given a set of 
DNA sequences (also referred to as the sample) S = {Si, S 2 , ■■■St}, and a set 
of parameters Z, d and k, the problem is to find all length- Z patterns that occur 
with up to d mismatches in at least k different sequences in the sample. 

One of the earliest techniques to solve this problem, as presented in [PevSOO] 
is known as the pattern driven approach. The pattern driven approach searches 
all of the pattern space - it enumerates each possible pattern and checks if it 
meets the search criteria. If the pattern length is Z, there are 4* possible patterns, 
assuming a DNA alphabet. Pattern driven approaches take each one of these 
patterns and compare them with all the Z-grams in the sample. This approach 
takes exponential time in terms of Z, and the problem quickly becomes practically 
unsolvable even for moderate values of Z. 

A faster approach, termed by [EskP02] as the Sample Driven Approach 
(SDA), searches a reduced search space of only the Z-grams that occur in the 
sample and their d-mismatch neighbors. The SDA algorithm trades in space for 
time: it maintains a table of size 4^, each entry in the table corresponding to 
a pattern. For each Z-gram in the input sample, the algorithm enumerates all 
the patterns that make up its d-mismatch neighbor hood. For each pattern in 
the neighborhood, the corresponding entry in the table is incremented. After all 
the Z-grams have been processed, the patterns in the table that have a score 
greater than k are reported. The problem with the SDA approach is that the 
memory requirements are huge, and increase exponentially with Z. Therefore the 
SDA approach, like the PDA approach, becomes quickly unmanageable, even for 
moderate values of Z. 

The WINNOWER algorithm [PevSOO] and the cWINNOWER algorithm 
[LiaOS] are based on graph theory. In these algorithms, a graph is constructed 
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in which each vertex is an Z-gram in the input sequence. Two Z-grams are con- 
nected by an edge if they mismatch in at most 2d positions. Now, the problem is 
mapped to the problem of finding fc-cliques in the graph. The problem of finding 
fc-cliques in graph, when Zc > 3 is known to be np-complete. Therefore, WIN- 
NOWER and cWINNOWER try to apply some heuristics to arrive at a solution. 
In the first step, all the nodes that have a degree less than k-1 are removed. After 
that, different techniques are applied to try to remove the spurious edges in the 
graph that can not be part of a solution. The complexity of WINNOWER and 
cWINNOWER for the most sensitive versions of the algorithms are given by 
and O(Z^n^), respectively. However, it is important to note that even 
though it is claimed that the most sensitive versions of these algorithms solve 
almost all practical problems, they are not guaranteed to solve a given problem. 

Some of the other approaches include suffix tree -based approaches [Sag98, 
PavMPOl]. The SPELLER algorithm presented in [Sag98] first builds a suffix 
tree for the input sequence. It then examines all possible patterns traversing 
through the suffix tree. If the paths to k different leaves of length Z mismatch 
with the pattern in at most d positions, then the pattern is reported. Starting 
with zero characters at the root, the pattern is extended one character at a time. 
At any time if there are less than k different paths in the suffix tree that mismatch 
in at most d positions with the current pattern, the search is stopped and the 
(alphabetical) next pattern of the same length, or the next pattern of a shorter 
length is searched. The complexity of the algorithm is given as 0(nt^Z‘^4‘^). 

In the sequence driven approach, each 1-gram is searched separately. The 
Mitra-Count algorithm [EskP02] is based on the idea that if all the Z-grams are 
searched concurrently, then only the information about those Z-grams that meet 
the current search criteria need to be stored. This will reduce the memory re- 
quirements drastically. The MITRA algorithm searches the pattern search space 
in a depth first manner, abandoning the search whenever the search criterion 
is no longer met. For this it uses the mismatch tree data structure. The path 
from the root to a node at depth m in the mismatch tree represents a prefix of 
the pattern of length m. The list of Z-grams from the sample whose m-length 
prefixes mismatch in at most d positions with the path label of the current node 
are stored at the node. The tree is built in a depth- first fashion. Whenever the 
size of the list of Z-grams at a node falls below k, the node is discarded, and the 
sub tree of the node is never searched. Whenever the search reaches a depth Z, 
the pattern corresponding to the path label is reported. The algorithm is mem- 
ory efficient, since only the nodes that lie in the current path need to be stored 
at any time. An improved algorithm, Mitra-Graph, also presented in [EskP02] 
applies WINNOWER -like pair wise similarity information in order to maintain 
a graph at each node of the mismatch tree. If two Z-grams L\ & L 2 mismatch in 
di & CZ 2 positions respectively with the node label, and if their suffixes beyond 
the current depth mismatch in q positions, then the two Z-grams are connected 
by an edge if di + d 2 + q > 2d. The nodes can be discarded if there is no pos- 
sibility for a fc-clique in the graph. Even though there is an extra overhead of 
maintaining the graph and extending the graph at each node, much smaller pat- 
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tern sub-space needs to be searched in Mitra-Graph. The theoretical complexity 
of Mitra is claimed to be the same as that of the SPELLER algorithm. 



3 The PRUNER Algorithm 

3.1 Our Contributions 

Our approach is based on the WINNOWER algorithm [PevSOO, Lia03]. As in 
WINNOWER, we build a graph based on pair-wise similarity information, and 
prune the graph eliminating vertices that can not be part of a solution. How- 
ever, after this point, we employ a different approach. The algorithms try to 
successively remove edges from the graph, after checking all the patterns that 
mismatch in at most d positions from both the Z-grams that are connected by 
the edge. We categorize the edges into two groups. Groupl consists of edges 
that connect Z-grams that differ in more than d positions, and Group2 consists 
of edges that connect Z-grams that differ in less than or equal to d positions. In 
the following sections, we will show that there will be relatively fewer patterns 
that mismatch in at most d positions from both the Z-grams that are connected 
by a Groupl-edge. Precisely, we will show that there will be at most 
such patterns for every Group-1 edge. Each Group-2 edge, on the other hand, 
can have such patterns. We present a technique which enumerates all 

the patterns corresponding to each Groupl-edge, checks each one of them to see 
if they satisfy the search criteria, and removes the Groupl-edge. We show that 
at least k monad patterns can be reported without enumerating the patterns 
corresponding to the Group-2 edges, there by avoiding the complexity. 

Unlike WINNOWER and cWINNOWER[Lia03], our algorithm is guaranteed to 
find a solution in a^) time using 0{ntl^ a^) space. 

3.2 Problem Statement 

In the discussion that follows, for convenience in illustration, we treat the input 
sample as a single sequence of size n. The time and space complexities are not 
affected by this simplification. In section3.5, we explain the enhancements to 
handle t different sequences, instead of a single sequence. Therefore, the problem 
can be stated as follows: given a string S of length n over the alphabet ^ = 
{A, C, G,T}, the problem is to find a pattern P of length Z that occurs at least 
k times in S with at most d mismatches in each occurrence. 

3.3 Terms and Definitions 

We denote a length-Z substring(an Z-gram) of S starting at position z in 5 by 
Li. A score h = D{Li,Lj) indicates the number of positions in which the two 
Z-grams Li, Lj mismatch. We denote the set of patterns that mismatch with 
both Li and Lj in at most d positions by p{Li, Lj). We refer to the set p{Li, Lj) 
also as the set of patterns that are consistent with Li and Lj . We now describe. 
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(a) 



(b) 



<hi+h2-h> 



Fig. 3. ( a) The matching(M) and mismatching(H) regions of Li, Lj. (b) Different 
regions of the pattern P. The regions in black are the regions in which Li, Lj mismatch 
with P 



briefly, how to compute the size of the set p{Li, Lj). Let P be any pattern such 
that P G p{Li, Lj). Now, it is important to note that p{Li, Lj) = {<!>} if h > 2d. 
We have to enumerate all the different possibilities for P. Also, let us divide 
each Z-gram into two regions: M-region, consisting of positions in which Li and 
Lj match with each other, and the i7-region, consisting positions in which Li 
and Lj mismatch with each other, as shown in Figure 3-(a). Both the regions 
are shown to be contiguous for simplicity in illustration. In reality, these regions 
need not be contiguous. Now, let us assume patterns Li and Lj mismatch with 
P in dc positions within the M-region. Additionally, let Li mismatch with P in 
hi positions, and let Lj mismatch with P in /i 2 positions, as shown in Figure 
3-(b). Again, none of these regions needs to be contiguous. 

Now, dc mismatch positions can be chosen from I — h positions in ^ 
ways. At each one of these positions, we have cr — 1 = 3 symbols to choose from. 
Similarly, hi positions in which Li can mismatch with P can be selected from h 

positions in ( | ways. The remaining h — hi positions in Li have to match with 

\hij 

P, and hence they mismatch with P in Lj (since we know that Li mismatches 
with Lj in these positions). The remaining (hi + h 2 — h) positions in which Lj 

mismatches with P can be selected from hi positions in ( ^ ) ways. 

\hi + h2- hj 

We have cr — 2 = 2 options at each one of these (hi + h 2 — h) positions, since 
P mismatches with both Li and Lj. Therefore, the total number of patterns in 
(Li,Lj) is given by: 







dc = 0 




d — d^ d — d 

V V (N( + 

h-^=h — d-\-d(j h'2 — h — 



,iih<2d 



— 0 otherwise 



(1) 
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In the above expression, \p{Li,Lj)\ increases when h decreases. When d < 
h < 2d, the maximum value of |p(Ai, Lj)\ occurs when h = d+1. When h = d+1, 
the maximum value that dc can take is given by dc = which is equal to | 



when d is odd, and f — 1 when d is even. Now, 




S'^ds in 



Therefore, on the whole, \p{Li,Lj)\ is in 0 (/ 242 ). 



3.4 The PRUNER-I and PRUNER-II Algorithms 

In both the algorithms, we construct a graph G(L, E) where each vertex is an l- 
gram in the input sample, and there is an edge (Li, Lj, D{Li, Lj)) connecting two 
Z-grams Li and Lj if D{Li,Lj) is less than or equal to 2d. We then successively 
remove vertices representing Z-grams from the graph G(L, E) that have a degree 
less than k — 1, and remove the edges that are incident on these vertices. Until 
this point, our algorithms are no different from WINNOWER. However, they 
differ from WINNOWER in the following steps. 

Both the PRUNER-I and the PRUNER-II algorithms process each vertex 
successively. The PRUNER-I algorithm enumerates the consistent patterns for 
every groupl-edge (i.e., edges between Z-grams which mismatch in more than 
d positions). It then computes how many times each pattern repeats. It does 
this by adding all the consistent patterns for each edge to a list, sorting and 
scanning the list. Each time a pattern appears, it means that the pattern is 
within d mismatches from another Z-gram. Hence, if a pattern repeats Zc — 1 
times, it means that the pattern is within d mismatches from Zc — 1 other Z- 
grams. However, since we have not yet processed the Group2-edges(i.e., edges 
connecting Z-grams that mismatch in d or fewer positions), we can not yet discard 
the patterns that repeat less than Zc — 1 times. We do not want to evaluate all the 
consistent patterns for the Group2-edges, as there are too many (0(Z‘^4‘^)) such 
patterns. Therefore, we will have to take each pattern in the list, and compare 
it with each Z-gram that is connected to the current vertex through a Group2- 
edge. Only then will we know how many times each one of those patterns has 
repeated. An efficient way of doing all this is presented below. 

At each node Li, we enumerate the consistent patterns p{Li,Lj) for all the 
Groupl-edges, i.e., edges {Li, Lj, D{Li, Lj)), such that d < D{Li,Lj) < 2d. We 
add these patterns to a list rj{i), and remove the edge {Li, Lj,D{Li, Lj)). Lemma 
1 states that we can safely remove the edge {Li, Lj, D{Li, Lj)) after enumerating 
and adding p{Li,Lj) to rj{i). 

Lemma 1. After a vertex Li in {Li, Lj, D{Li, Lj)) is processed, there can be no 
new patterns in p{Li,Lj) that were not reported while processing Li, but will be 
reported while processing the vertex Lj . 

Proof. Let us assume that there is a pattern P e p{Li, Lj) that was not reported 
while processing node Lj, but will be reported while processing node Lj. This 
means that there are a set of Z-grams if{P) other than Lj, such that for each Lg G 
'ip{P), there is an edge {Lg, Lj, D{Lg, Lj)) connecting Lg and Lj, and D{Lg, P) < 
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d. Additionally, since P will be reported while processing Lj, \ip{P)\ > k—2. Now, 
since for each Lq e ip{P), D{Lq,P) < d and D{Li,P) < d (as P G p{Li,Lj) by 
definition), it implies that D{Li, Lq) < 2d. Therefore, for each Lq G ’ip{P) there 
is an edge {Li, Lq, D{Li, Lq)) connecting Li and Lq. Since |^/’(P)| > k — 2, and 
P G p{Li, Lj), there are at least k — 1 edges incident in Li which contain P as 
one of their consistent patterns. Therefore, pattern P must have been reported 
while processing node Li. Hence there can be no pattern P G p{Li,Lj) that is 
not reported while processing Li that can be reported while processing Lj . □ 

Now, we need to find out how many times each pattern is repeated in rj{i). 
An easy way of doing this will be to sort rj{i), and scan rj{i). As each pattern in 
^{i) is a length-1 string of a fixed alphabet, can be sorted in linear time using 
radix sort. Let a pattern P repeat m times in rj{i). Let R be the degree of node 
Li after processing and removing all Groupl-edges. As explained in sectionl, R 
is expected to be very small. We do the following: 

• If TO-I- P < fc — 1, we discard P. The number of times P repeats can increase 
by at most R, by comparing P with each one of the Group2-edges. If m < 
k—l — R, there is no way that P can repeat k — 1 times. So we can discard 

P. 

• If m > fc — 1, report P, since it is clear that P has already occurred at least 
k — 1 times. 

• Iffc— 1 < m + R < k — 1 — R, we compare P with all Lgrams that are 
still connected to Li. For each such Lgram that mismatches P in at most d 
locations, we increment the repeat count of P. If the repeat count reaches 
fc — 1, we report P. Other wise, we discard P. 

Before we leave Li and proceed to process the next vertex, we can do one more 
thing - we can remove the vertex Li from the graph if R < k — 1, without ever 
enumerating the consistent patterns for these edges. Lemma 2 proves this. 

Lemma 2. If the residual degree R of vertex Li is less than k— \ after processing 
and removing all Groupl-edges of Li, there can he no new patterns that will be 
reported by processing the Group2-edges. 

Proof. Let us assume that there is a pattern P that was not reported while 
processing the Groupl-edges, but will be reported while processing the Group2- 
edges. Since we will be reporting P, and since R < k — 1, there should have been 
at least one Groupl-edge {Li,Lq,D{Li,Lq) such that P G p{Li,Lq). Therefore, 
P was checked and reported while processing vertex Lq. Hence there can be no 
new patterns that will be reported by processing the Group2-edges. □ 

We are now left with a graph in which the score of each edge is at most d, 
and degree of each remaining vertex is at least k — 1. Therefore, if the graph has 
any vertices left, there will be at least k vertices left in each connected component 
of the graph. In practice, we do not expect any vertices to remain at this stage, 
as our assumption is that there are not too many patterns that meet the search 
criteria. All the Lgrams that do remain until this stage are themselves valid 




New Algorithms for Finding Monad Patterns in DNA Sequences 281 



ProcessLGram() 

Inputs: G(L,E), i, I, d, k 

Output: Reports patterns in the d-mismatch neighborhood of Li that satisfy the search criteria 

1. PatternList •«— {<?!)} 

2. for every j such that (Li, Lj , D{Li, Lj)) G E do 

3. if D{Li, Lj) ~> d j* checking if {Li^ Lj , D{Li, Lj)) is a Group-1 edge* j 

4. PatternList •«— PatternList U p{Li, Lj) 

/* the set p{Li,Lj) of consistent patterns is enumerated by a subroutine at this point* f 

5. E ^ E — (Li, Lj , D{Li, Lj)) /* The Group-1 edge is immediately removed */ 

6. end if 

7. end for 

8. RadixSort( PatternList) 

9. Gnt •«— 0 /* Gnt is the number of times the current pattern has repeated * / 

10. for j 1 to \PatternList \ — 1 

11. if PatternListj = PatternListj_i 

12. Cnt ^ Gnt + 1 

13. else if Cnt > fe — 1 — Degree(Li) j* Degree(Li) is the residual degree (the degree of 

Group-2 edges of Li) , since all the Group-1 edges have been removed in step 6*1 

14. for every r such that (Li, Lr, D(Li, Lr)) G E do /* for each Group-2 edge */ 

15. ifZ>(PatternListj , Lr) < d 

j* check if the pattern PattemListj is in the d-mismatch neighborhood of Lr* j 

16. Cnt Cnt + 1 

17. end if 

18. end for 

19. if Cnt > k — 1 

20. Report(PatternListj ) j* PatternList j is an (l,d) — k pattern *j 

21. end if 

22. Cnt ^ 0 

23. end if 

24. end for 



Fig. 4. The routine that checks the d-mismatch neighborhood of each /-gram 



solutions, since they mismatch in at most d positions with at least k — \ other 
/-grams. Hence, we report all the remaining /-grams. Beyond this, there might 
be other patterns in the graph that meet the search criteria, but in a general 
case, we assume that there are fewer than k distinct monad patterns in the 
given sample. In the almost impractical scenario that there are more than k 
distinct monad patterns, the algorithms we present report at least k of them. 
The PRUNER-I algorithm is presented in detail in figures 4 and 5. 

The PRUNER-II algorithm is very similar to the PRUNER-I algorithm in 
concept. However, the PRUNER-H algorithm attempts to eliminate the poten- 
tially huge memory requirements of the PRUNER-I algorithm. While processing 
each node Li, the PRUNER-I algorithm maintains a list 7]{i) that contains all 
the patterns that are consistent with each one of the Groupl-edges. When the 
number of such edges is huge, the amount of memory required for rj^i) may be 
too big. Especially, this might be the case when d is large and the d/l ratio is 
large, in which case the graph G{L, E) will be highly connected. 

At each vertex Li, the PRUNER-H algorithm processes edges one by one. 
For each edge {Li, Lj, D{Li, Lj)), it enumerates the set of consistent patterns 
p{Li,Lj). For each consistent pattern P S p{Li,Lj), if we compare P with all 
the /-grams that are directly connected with vertex Li, we can determine if P 
mismatches in at most d positions with at least /c — 1 of them. However, a deeper 
analysis reveals that it not necessary to compare P with all the /-grams that 
share an edge with Li. For any /-gram Lg, if D{Lg,P) < d, then D{Lg,Lj) will 
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be less than or equal to 2d. This means that the Z-grams Lq and Lj will also 
be connected. Therefore, we only need to compare P with all vertices Lq such 
that the edge {Lq, Lj, D{Lq, Lj)) G E. If at least k — 2 oi them mismatch with 
P in fewer than d positions, it reports P. Otherwise, P is discarded. As in the 
PRUNER-I algorithm, it removes the edge {Li, Lj, D{Li, Lj)) after checking all 
the patterns in p{Li,Lj). 

SearchForPatterns() 

Inputs: S, I, d, k, n 

1. Buildgraph(S,l,d,k,n) j'^The routine that builds the graph G (L,E)* / 

2. PruneGraph(G(L,E) ,l,d,k,n) /* The pruning routine which removes 

all the vertices with degree < fc — 1*/ 

3. for z^Oton— / + ! do 

4. ProcessLGram( G(L,E),i,l,d,k,n) 

5. ifDegree(Li) < fc — 1 

6. RemoveLGram(G(L, £^), z) /* remove the vertex Li 
(and the edges incident on Li) from the graph */ 

7. end for 

8. PruneGraph(G(L,E) , I, d, k, n) /* remove l-grams with degree < fc — 1 */ 

9. for z-«— Oton— / + ! do !*check if any l-grams are still remaining* ( 

10. if Degree(Li) > k — 1 

11. Report(Li) /* report all remaining l-grams* / 

12. end if 

13. end for 



Fig. 5. The PRUNER-I algorithm 



3.5 Extending the Algorithm to Handle Multiple Sequences 

When the input sample is made of t sequences of length n each, and the problem 
is to find an {I, d) motif that occurs in at least k of them, the graph G{L, E) 
will be a t-partite graph. At each vertex in the graph, we need to maintain and 
update another variable, which we call t-degree. The variable t-degree stores the 
number of distinct sequences in t that the current vertex is connected to. In the 
algorithms that we discussed above, whenever we are referring to the degree of 
a vertex, we will be using t-degree instead of the actual degree of the vertex. 
Whenever we are checking for a pattern P, it is no longer sufficient to check if the 
pattern is within d mismatches from k — 1 other t-grams. We need to make sure 
that the Ugrams are derived from k — 1 distinct sequences in the sample. The 
implementation typically involves maintaining a bit-vector of length t for the 
pattern that is being considered. Whenever the pattern is within d-mismatches 
from an Ugram, the bit corresponding to the sequence from which the Ugram is 
derived is set to 1. P satisfies the search criteria if at least k — 1 (or whatever is 
necessary at that point in the algorithm) bits are set to 1. 

3.6 Complexity Analysis 

Building the graph involves calculating the mismatch count for each Ugram 
pair {Li,Lj) such that Li and Lj are derived from different input sequences. 
There are {n — I + 1) Ugrams for each input sequence, and n{t — 1) other l- 
grams for each Ugram in the input sequence. Therefore, building the graph takes 
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Table 1. Performance of the algorithms 



Test case 
(i,d) 


d/l 

ratio 


1 PRUNER-I 1 


1 PRUNER-II 1 


Test case 
(i,d) 


d/l 

ratio 


1 PRUNER-I 1 


1 PRUNER-II 1 


jym 


HSial 


jym 




jym 


Hsil 




Hsal 


13,3 




bbu 


43 


bbej 


43 


13,4 


■Hfctl 




166 


bsbs 


278 


14,4 


B£ti] 


■itM 


198 




178 


15,4 


bbsu 


IBS 


122 


■u 


91 


16,4 


bbb 




51 




43 


16,5 


bbj 


BBS 


540 




247 


17,5 




W8II8EI 


315 


Mama 


161 


18,5 


BBS 


ifcma 


174 


IBBS 


92 


19,5 


WWifcl 


■ifiM 


101 


■llWI 


48 


20,5 


■iwa 


IBS 


65 


UBS 


21 


21,5 


fllMifcl 


■lltfcl 


10 


■lltfcl 


11 


22,5 




bbb 


56 


UBS 


7 


22,6 


WMbM 




649 


tma 


83 


23,6 


niMaii 




525 


UBS 


64 


24,6 






720 


BBS 


11 


25,6 


BBS 


US 


592 


UBS 


60 


26,6 


nii**i 


B2B 


613 


■ima 


62 


27,7 


niwaa 


|out of memory 


UBU 


614 


28,7 




|out of memory 


■mna 


640 


29.7 


IBEU 


bbs^^es 


1 1-41| 


640 



0{n^t^). Pruning the graph involves removing all the edges incident on each 
vertex whose degree is less than k—1. In the worst case, we might have to delete 
all the nodes, so the maximum number of edges that need to be removed is 
{{k — l)nt — 1), which is 0{ntk). This time is common for both PRUNER-I and 
PRUNER-II. In the PRUNER-I algorithm, each ^-gram can have up to n{t — 1) 
2d- mismatch neighbors. Therefore, at each /-gram, we might have to enumerate 
the consistent patterns with n{t — 1) other /-grams. The maximum number of 
these consistent patterns as discussed in section 3.3, is 0 (/ 2 42 ). Hence the worst- 
case time complexity at each node is given by 0{ntl^A^). We need to store all 
these patterns in a list, so we need space. In the worst case, we will 

have to process (t — k + l)n /-grams, since no new patterns can be discovered 
after removing all the vertices corresponding to (t — k + 1) sequences in the 
sample. Therefore, the overall complexity is given by 0{n^t{t — k + l )/2 42 ). If 
k is small w.r.t. t, this will be When k = t, the complexity of the 

PRUNER-I algorithm is 0{n^tli Ai). In case of the PRUNER-II algorithm, each 
edge is processed separately. All the patterns consistent with each edge {Li, Lj) 
have to be compared with all the /-grams that are connected to both Li and Lj . 
In the worst case, there can be n{t — 2) vertices that are connected to both Li 
and Lj. The total number of the edges could be n^t{t — 1) in the worst case. 
The edge can have 0(/^4^) patterns that are consistent with it, so the total 
time taken will be 0(n^t^/ 2 42 ). Each pattern could be compared separately; 
therefore the space needed is approximately the same as that necessary for the 
graph. 



4 Results 

The algorithms were tested on generated samples containing 20 sequences of 600 
nucleotides each. The sequences are implanted with randomly mutated patterns 
at randomly chosen positions. Each occurrence of the pattern is allowed to have 
up to d mismatches. The tests were carried out on a Pentium-4 3.2 GHz PC 
with 2GB of memory, running Redhat Linux 9.0. The time/memory results are 
presented in Table 1. The PRUNER-I algorithm ran out of memory for the (27,7) 
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and the (28,7) cases. The implanted pattern was detected in all the remaining 
test cases. 

5 Conclusion 

We have presented two new algorithms for finding the monad patterns. Both 
the algorithms perform extremely well on the challenge problem of (15,4) on 20 
input sequences of 600 nucleotides. As d increases in comparison to I, i.e., when 
the d/l ratio increases, the PRUNER-I algorithm takes a longer time and a larger 
memory. The PRUNER-I algorithm runs out of memory for large values of I and 
d. The PRUNER-II algorithm, on the other hand, can handle large values of I 
and d, but reacts very sharply to the d/l ratio. As long as the d/l ratio is around 
0.25, the PRUNER-II algorithm performs very well, independent of the actual 
values of I and d. Unlike Winnower and cWinnower, the algorithms we presented 
here are not sensitive to k. Our algorithms will be able to detect patterns even 
for very small values of k. The only concern when dealing with very small values 
of k is that there could be random signals in the input sample that meet the 
search criteria. An interesting observation from the test cases is that the graph 
itself starts consuming more and more space as the d/l ratio gets bigger. This 
is because there are more and more edges in the graph, as there are a larger 
number of l-gram pairs that mismatch in less than 2d positions. In the future, 
we plan to investigate compact representations for the graph. Another approach 
may involve using a two-pass algorithm. WINNOWER or cWINNOWER can 
be used initially in order to remove some spurious edges. Our algorithms can be 
applied in the second pass. As the graph has much fewer edges now, PRUNER-I 
or PRUNER-II may have very good performance. For the first pass, we can use a 
low sensitivity version of WINNOWER or c WINNOWER in order to maximize 
the speed. 
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Abstract. We present in this paper three algorithms. The first extracts 
repeated motifs from a weighted sequence. The motifs correspond to 
words which occur at least q times and with hamming distance e in 
a weighted sequence with probability > 1/k each time, where fc is a 
small constant. The second algorithm extracts common motifs from a 
set of A > 2 weighted sequences with hamming distance e. In the second 
case, the motifs must occur twice with probability > 1/k, m 1 < q < N 
distinct sequences of the set. The third algorithm extracts maximal pairs 
from a weighted sequence. A pair in a sequence is the occurrence of the 
same substring twice. In addition, the algorithms presented in this paper 
improve slightly on previous work on these problems. 



1 Introduction 

DNA and protein sequences can be seen as long texts over specific alphabets en- 
coding the genetic information of living beings. Searching specific sub-sequences 
over these texts is a fundamental operation for problems such as assembling the 
DNA chain from pieces obtained by experiments, looking for given DNA chains 
or determining how different two genetic sequences are. However, exact searching 
is of little use since the patterns rarely match the text exactly. The experimental 
measurements have various errors and even correct chains may have small differ- 
ences, some of which are significant due to mutations and evolutionary changes. 

Finding approximate repetitions or signals arise in several applications in 
molecular biology. Moreover, establishing how different two sequences are is im- 
portant to reconstruct the tree of the evolution (phylogenetic trees). All these 
problems require a concept of similarity, or in other words a distance metric 
between two sequences. Additionally, many problems in Computational Biol- 
ogy involve searching for unknown repeated patterns, often called motifs and 
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identifying regularities in nucleic or protein sequences. Both imply inferring pat- 
terns, of unknown content at first, from one or more sequences. Regularities in 
a sequence may come under many guises. They may correspond to approximate 
repetitions randomly dispersed along the sequence, or to repetitions that oc- 
cur in a periodic or approximately periodic fashion. The length and number of 
repeated elements one wishes to be able to identify may be highly variable. 

In the study of gene expression and regulation, it is important to be able 
to infer repeated motifs or structured patterns and answer various biological 
questions, such as what elements in sequence and structure are involved in the 
regulation and expression of genes through their recognition. The analysis of the 
distribution of repeated patterns permits biologists to determine whether there 
exists an underlying structure and correlation at a local or global genomic level. 
These correspond to an ordered collection of p boxes (always of initially unknown 
content) and p — 1 intervals of distances (one between each pair of successive 
boxes in the collection) . Structured patterns allow to identify conserved elements 
recognized by different parts of a same protein or macromolecular complex, or 
by various complexes that then interact with one another. 

In this work, we examine various instances of the Motif Identification Problem 
in weighted sequences. In particular, we are given a set of weighted sequences 
S = {Si, S 2 , ■ ■ ■ , Sk}, Si € S* and we are asked to extract interesting motifs 
such that each motif occurs in at least q sequences. 

Generally speaking, a weighted sequence could be defined as a sequence 
of (symbol, weight) pairs, S = ((si, wi), (s 2 , W 2 ), • • • (s„, w„)), where Wi is the 
weight of symbol Si in position i (occurrence probability of Si at position i). 

Biological weighted sequences can model important biological processes, such 
as the DNA-Protein Binding Process or Assembled DNA Chains. Thus, motif 
extraction from biological weighted sequences is a very important procedure in 
the translation of gene expression and regulation. In more detail, the extracted 
motifs from weighted sequences correspond in general to binding sites. These 
are sites in a biological molecule that will come into contact with a site in an- 
other molecule permitting the initiation of some biological process (for instance, 
transcription or translation). In addition, these weighted sequences may cor- 
respond to complete chromosome sequences that have been obtained using a 
whole-genome shotgun strategy [10]. By keeping all the information the whole- 
genome shotgun produces, we would like to dig out information that has being 
previously undetected after being faded during the consensus step. Finally, pro- 
tein families can also be represented by weighted sequences ([4], in 14.3.1). 

A weighted biological sequence is often represented as a d x n matrix, which 
is termed weighted matrix, where d is the size of the respective alphabet (in the 
case of DNA weighted sequences d = 4) and n is the length of the sequence. Each 
cell of the weighted matrix pij stores the probability of appearance of symbol i in 
the position of the input sequence. An instance of a weighted (sub) sequence 
p is a (sub)sequence of p where a symbol has been chosen for each position. 
The probability of occurrence of this instance is the product of the probabilities 
of the symbols of all positions of the instance. For example, for the instance 
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/ PAl PA2 PA3 
Pci PC 2 Pcs 
PCI PG 2 Pas 
\pti PT2 Pts 

Fig. 1. The Weight Matrix representation for a weighted DNA sequence 



• • • PAn \ 

• • • PCn 

■ ■ ■ PGn I 
■■■ PTnJ 



Ai,A 2 - ■ . ■ ,An of the weighted sequence shown in Figure 1, the probability of 

n n 

PAi’ 

A great number of algorithms has been proposed in the relative literature 
for inferring motifs in biological sequences (e.g. regulatory sequences, protein 
coding genes). The majority of algorithms relies on either statistical or machine 
learning approaches for solving the inference problem. In [11] authors defined a 
notion of maximality and redundancy for motifs, based on the idea that some 
motifs could be enough to build all the others. These motifs are termed tiling 
motifs. The goal is to define a basis of motifs, in other words a set of irredundant 
motifs that can generate all maximal motifs by simple mechanical rules. 

Other approaches build all possible motifs by increasing length. These solu- 
tions have a high time and space complexity and cannot be applied in the case 
of weighted sequences, due to their combinatorial complexity. Finally, in [12,8] 
authors use the suffix tree to spell all valid models (exact or approximate). 

In addition, finding maximal pairs in ordinary sequences was firstly described 
by Gusfield in [4]. This algorithm uses a suffix tree to report all maximal pairs 
in a string of length n in time 0(n -I- a) and space 0(n), where a is the number 
of reported pairs. In [1] authors presented methods for finding all maximal pairs 
under various constraints on the gap between the two substrings of the pair. In 
a string of length n, they find all maximal pairs with gap in an upper and lower 
bounded interval in time 0{n\ogn + a) . If the upper bound is removed the time 
is reduced to 0{n + z). 

The structure of the paper is as follows. In Section 2 we give some basic 
definitions on weighted sequences to be used in the rest of the paper. In Section 3 
we address the problem of extracting simple models, while in Section 4 we address 
the problem of Motif Extraction in weighted sequences. Finally, in Section 5 we 
conclude and discuss open problems in the area. 

2 Preliminaries 

In this section we provide formal definitions of the problems we tackle, we 
give some basic definitions and finally we describe briefly the best known al- 
gorithms on these problems in the case of solid sequences (sequences that are 
not weighted). The first problem we wish to solve is the repeated motifs problem. 

Problem 1 Given a weighted sequence s and three integers 0<k<c, e>0 
and q > 2, for some small constant c, find all models m with probability of 
occurrence > ^ such that m is present at least q times in s and the Hamming 
distance between all occurrences is < e. All these occurrences must not overlap. 
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The non-overlapping restriction is added because when two models a and b 
of s overlap, then it may be the case that a cancels b. More specifically, assume 
that a and b overlap at position i. Then, it may be the case that a uses symbol 
ai G S with probability 7Ti(cri) while b uses symbol a 2 G S with probability 
7ti(cT2)) which is not correct. To overcome this difficulty we do not allow the 
occurrences of models to overlap. The second problem we wish to solve is the 
common motifs problem. 

Problem 2 Given a set of N weighted sequences S = Si (1 < i < N ) and three 
integers 0<fc<c, e>0 and 2 < q < N , for some small constant c, find all 
models m with probability of occurrence > ^ such that m is present in at least q 
distinct sequences of the set and the Hamming distance between them is < e. 

When a model satisfies the restrictions posed by each of the above problems, 
it is called valid. For the above two problems the spelling of models is done using 
the Weighted Suffix Tree (WST). The WST of a weighted sequence s, WST{s), 
is the compressed trie of all valid weighted subwords, starting within each suffix 
Si of s$, % ^ S. A weighted subword is valid if its occurrence probability is 
> 1/fc. The WST is built in linear time and space when k is a small constant. 
The WST was firstly presented in [5] as an elegant data structure for reporting 
the repetitions within a weighted biological sequence. In [6] authors presented 
an efficient algorithm for constructing the WST. 

Finally, the third problem we wish to solve is the following. 

Problem 3 Given a set of N weighted sequences S = si,S 2 ,---s„, an integer 
0 < k < c and a quorum q < N, for some small constant c, find all maximal 
pairs m such that m is valid, that is, it appears with probability greater than ^ 
in at least q sequences of the set S. 

A pair in a string is the occurrence of the same substring twice. A pair is 
maximal if the occurrences of the substring cannot be extended to the left or 
to the right without making them different. The gap of a pair is the number of 
characters between the two occurrences of the substring. A pair is valid if each 
substring appears with probability > 

2.1 Basic Definitions 

Let S be an alphabet of cardinality a = S. A sequence s of length n is repre- 
sented by s[l..n] = s[l]s[2] • • • s[n], where s[z] G A for 1 < i < n, and n = |s| is 
the length of s. An empty sequence is denoted by e; we write S* = A+ U {e}. A 
weighted sequence is defined as follows. 

Definition 1. A weighted sequence s = SiS 2 ■ ■ ■ Sn is a set of couples (s,7Ti(s)), 
where TTifs) is the occurrence probability of character s at position i. For every 
position 1 <i < n, SiTi^s) = 1. 

Valid motifs in weighted sequences correspond to words that occur at least 
q times in the weighted sequence with probability of appearance > If we 
consider approximate motifs, then the distance of a valid motif should be < e. 
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Definition 2. A set Si of positions inside a weighted sequence s, represents a 
set of weighted factors of length I, that are similar, if and only if, there exists, 
(at least) a motif m G such that for all elements i in Si, distifm, Si) < e. 

In other words, the set Si contains all motifs of length I with at most e mis- 
matches. The size of Si is represented by V (e, 1) . We report the valid motifs using 
the WST. V{e,l) is an upper bound for the number of motifs that correspond 
to the maximum size of the output. 

Definition 3. The WST of a weighted sequence S, denoted as WST{S), is the 
compressed trie of all possible subwords made up from the weighted subwords 
starting within each suffix Si of S$, $ ^ S, and having an occurrence probability 
> where k is a small constant. Let L{v) denote the path-label of node v in 
WST{S), which is the catenation of edge labels in the path from root to v. Leaf 
V ofWST{S) is labeled with index i if 3j > 0 such that L{v) = Sij[i..n\ and 
TT{Sij[i- ■ -n]) > 1/fc, where j > 0 denotes the j-th weighted subword starting at 
position i. The leaf-list LL{v), is the list of the leaf-labels in the subtree ofv. 



2.2 Previous Work 

In the following, we sketch the algorithms proposed by Sagot [12] and Iliopoulos 
et al. [7], on which our solutions are based. The common characteristic of both 
papers is that the proposed algorithms make heavy use of suffix trees. In a 
nutshell, the suffix tree is an indexing structure for all suffixes of a string s and 
it is well known that it can be constructed in linear time and linear space [9] . The 
generalized suffix tree is a suffix tree for more than one strings. Since suffix trees 
is a well known indexing structure for strings, we will assume that the reader is 
familiar with its basic properties and characteristics. In the discussion to follow, 
for reasons of clarity we discuss the algorithm on the uncompressed suffix tree 
(a sequence of nodes with just one child is not collapsing into a single edge). 

The repeated and common motifs problems are handled in [12]. For the first 
problem the input is a string s with length n over an alphabet S and two 
integers q > 2 (the quorum) and e > 0 (the maximum number of mismatches) . In 
addition, the algorithm is given the length I of the wanted model. Consequently, 
if we want to find all possible models we have to apply the algorithm for each 
possible length (the same is applied also to the second problem). Finally, for 
both problems, the output of the algorithms is only the models and not the 
exact position of their appearance. 

Assuming that e = 0, the algorithm for the common motifs problem locates 
each node Vi that corresponds to a model rrii of length I and then checks if this 
model is valid, that is if it satisfies the quorum constraint. This is easy to do, by 
checking whether the number of leaves of each node Vi is larger than q. If we allow 
for errors, then a model rrii corresponds to many nodes Vi„,Vi^, . . . , Vi^ on the 
suffix tree. Apparently, this model is valid if the sum of leaves of all these nodes 
is larger than q. By a simple linear-time preprocessing it is very easy to compute 
the number of leaves for each node of the suffix tree. Note that occurrences of 
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models may overlap. The space used by the algorithm is linear while the time 
complexity for a specific length I is 0{nV{e,l)). 

For the common motifs problem, the input is a set of strings S = si, S 2 , • . • , 
sn and two integers q>2 and e > 0. First, a generalized suffix tree is constructed 
for S in time 0{nN). Then, the mechanism to check the quorum constrained is 
implemented. For each node v in the suffix tree, a bit vector by of N positions 
is constructed such that by[i] = 0 when in the subtree of v there is no leaf with 
label i, that is there is no occurrence of a suffix of string Si in the subtree of v 
(otherwise by[i] = 1). Then, the procedure is exactly the same as in the repeated 
motifs algorithm with the exception that we use the bit vectors to check whether 
the quorum constraint is satisfied. The space requirements of this algorithm is 
0{n^), where w is the word length of the machine. The time complexity is 
0{nN‘^V{e,l)), for a specified length 1. 

Finally, we come to the solution described in Iliopoulos et al. [7]. In this 
work all maximal pairs which occur in each string of a set of strings without 
any restrictions on the gaps are reported in 0(n + a), where a is the size of the 
output, and linear space. In addition, it reports all maximal pairs which occur 
in each string of a set of strings with the same gap that is upper bounded by a 
constant. This is achieved in 0(n log^ n + afV log n) time, where N is the number 
of strings and n is the total length of the strings, using linear space. 

We supply in this paper methods that encounter the above problems on 
weighted sequences. For simple motifs we propose an algorithm that works in 
0{nNqV (e, 1)) time and 0{nNq) space and for maximal pairs an 0{Nn log (Nn) 
+ a) algorithm using linear space. 



3 Extracting Simple Models 

In this section we supply an algorithm for reporting all maximal pairs in a set 
of weighted sequences. More specifically, given a set of N weighted sequences 
S = Si, S2,‘ ■ ■ Sn , a small integer k > 0 and a quorum q < N, we report all 
maximal pairs, whose components appear with probability greater than 1 /k 
in at least q sequences of the set S. We have considered two variations of this 
problem depending on the restrictions on the gaps. In the first version we assume 
that there is no restriction on the gaps of the pairs, thus one pair may appear in 
different sequencs with different gaps. In the second version of the problem one 
pair has to come along with approximately the same gap, which is upper bounded 
by a constant value b. For solving these problems we suggest two methods that 
are extensions of the algorithms that are provided in [7] for these problems on 
plain sequences. Our solutions encounter these problems on weighted sequences 
in a more simple and efficient way. 

Initially, a generalized weighted suffix tree gWST{S) is constructed. A gen- 
eralized weighted suffix tree is similar to the generalized suffix tree and is built 
upon all the weighted sequences of S. For the construction of gWST{S) the 
algorithm of [6] is used for each of the weighted sequences in S and all the pro- 
duced factors are superimposed in the same compacted trie. The total time for 
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this operation is linear to the sum of length of each of the weighted sequences 
(^(Sr=i ki|))- The construction method is invoked for each of the weighted 
sequences starting from the root of the same compacted trie. The suffix links 
are preserved so it is like building a generalized suffix tree from a set of regular 
sequences using the same auxiliary suffix tree. Thus, the space of the gW ST {S) 
is linear to the total length of the weighted sequences. The gWST{S) is a com- 
pacted trie with out-degree of internal nodes at least 2 and at most most cr = \E\. 
The first step, as mentioned and in [4], [1] and [7] is to binarize gWST{S). Each 
node u with out-degree |m| > 2 is replaced by a binary tree with |m| leaves and 
|u| — 1 internal nodes. Each edge is labeled with the empty string e so that all 
new nodes have the same path-label as node u, which they replace. Assuming 
that the alphabet size cr is constant, the whole procedure needs linear time and 
the final data structure has linear space. 

The indexes of the factors at the leaves of the gWST{S) are organized in 
special leaf-lists according to the weighted sequence Si that belong and the char- 
acter to their left (left-character). The left-character of an index i is the character 
that exists at position i — 1. In weighted sequences for an index i, there may be 
more than one choices for left-character. For that cases we introduce a new class 
called 'Zc(, that keeps all the indexes with more than one left-character. This new 
class guarantees the left maximality, as for any left-character of one index x in 
that class there is at least one index y with a different one. Thus, a leaf-list is 
a set of N vectors, one for each of the weighted sequences, where each vector 
contains a + I lists, one for each of the tr -I- 1 choices for left-character (Fig. 2). 




Fig. 2. The leaf-lists where the indexes are organized 



When the construction of the gWST{S) is completed, a bottom-up process 
is initiated. Let Li and be the leaf-lists of the left and right descendant of a 
node V. The candidate maximal pairs, defined by the path label of node u, for 
each of the sequences Si can be found by combining \/j the indexes of list Li.Si.lcj 
(the list for symbol Cj in weighted sequence Si) with the lists Lr.Si.lci,\/l yf j. If 
we do not allow overlaps on the components of a pair we don’t have to combine 
all the indexes of lists Li.Si.lcj and Lr-Si.lci but we want Vx G Li.Si.lcj to 
find all y G Lj.-Si.lci for which it holds that x — {y + \pathdahel{u)\) > 0 or 
y — {x + \pathJ,ahel{u)\) > 0. In order to achieve that efficiently the lists are 
organized as AVL trees, and merging virtually the one list with the other. More 
specifically, we find the position where the items of the one list increased and 
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decreased by \pathJength{u)\ will be placed if we really merge the two lists, and 
then by going rightwards and leftwards respectively. If we choose the smaller list 
and virtually merge it with the other, the following three lemmas guarantee that 
in total 0(7Vn log (Nn) + a) time is needed. 

Lemma 1. The sum over all nodes u of an arbitrary tree of size n of terms that 
that are 0{ni), where n\ < rz 2 are the weights (number of leaves) of the subtrees 
rooted at the two children of u, is 0(n log n) 

Proof. See [7]. 

Lemma 2. Two AVL trees of size n\ and ri 2 , where n\ < U 2 , can be merged in 
time 0(log (""j'”'')) 

Proof. See [2]. 

Lemma 3. Let T be an arbitrary binary tree with n leaves. The sum over all 
internal nodes u G T of terms , where n\ < are the weights of the 

subtrees rooted at the two children of u, is 0(n log n). 

Proof. See [1]. 

Before we retrieve the output for this step, we have to check if at least q 
of the weighted sequences Si report at least one pair. This can be accomplished 
during the virtual merging of the lists. We apply the virtual merge to all possible 
combination of lists but we spend two more operation for each of the items of the 
smaller list to check if there is at least one candidate pair for the corresponding 
sequence. If at least q sequences have at least one maximal pair we retrieve 
the rest of the answer. This additional step adds ni (the smaller half) more 
operations so according to Lemma 1 the overall cost is O(nlogn). After the 
reporting step, the leaflists Li and Lr are merged, merging each list Li.Si.lcj 
with the Lr.Si.lcj ^i,). This step according to Lemma 3 costs 0{Nnlog (Nn)) 
in total. The result is summarized in the following theorem. 

Theorem 1. Given a set of N weighted sequences S = si,S 2 ,---s„, a small 
integer k > 0 and a quorum q < N , we can find in time 0{Nn\og {Nn) + a) all 
maximal pairs m such that each component of m appears with probability > ^ 
and with no overlaps in at least q sequences of the set S, where a is the size of 
the size of the answer. 

When the overlap constraint is removed the query becomes more time con- 
suming. The output has to be filtered and checked if the the overlap of the com- 
ponents of a pair is the same substring. This is the crucial step because at each 
position of overlap there must be the same choices of symbols from the two com- 
ponents. This can be accomplished by pre-processing of the gWST{S) to answer 
nearest ancestor queries in constant time [13]. When a candidate pair of indexes 
x,y has an overlap {y < x + \pathJ,abel{u)\) then the nca{x,y) query upon the 
gWST{S) dictates the longest common extension of these two sub-factors from 
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positions x,y. If the answer of this query is greater than the positions of the 
overlap it means that the portion of the overlap is the same sub-string in the 
two factor. In this case the time complexity becomes 0{{Nn)^). 

In the second version of the problem one pair has to come along with ap- 
proximately the same gap, that is upper bounded by a constant value b, in at 
least q weighted sequences. We can extend the previous method in order to solve 
this variation of the problem. At each internal node u during the reporting step 
we apply a virtual merge and for each index from the smaller list we retrieve as 
described above at most 2b indexes for candidate pairs. The indexes that overlap 
with the former index are validated with nca queries and some are rejected. To 
check if a maximal pair with approximately the same gap occurs in at least q 
weighted sequences we apply the following bucketing scheme. We have b buckets, 
each for one of the permitted values of the gap. Each candidate pair is placed 
to one bucket according to the gap. At the end of the reporting step we scan all 
the buckets and we report the ones that have size at least q. The buckets can 
be implemented as linear lists and this checking can be done in constant time 
by storing the size of the lists. Then, the reporting step is invoked which is the 
same as in the case of unrestricted gaps. The running time of this method is 
determined by the actual and virtual merging step that as before is 0(n log n) 
as well as a constant number of operations in every internal node. The following 
theorem summarizes the result: 

Theorem 2. Given a set of N weighted sequences S = si,S 2 ,---s„, a small 
integer k > 0 and a quorum q < N , we can find in time 0{Nnlog (Nn) + a) all 
maximal pairs m such that each component of m appears with probability > ^ 
and the gap is hounded by the constant b, in at least q sequences of the set S, 
where a is the size of the output. 

4 Extracting Simple Motifs 

In this section we present algorithms for the repeated and common motifs prob- 
lems on weighted sequences. Our algorithms are based on the algorithms of Sagot 
[12] with the exception that for the repeated motifs problem we add the restric- 
tion that the models must be non-overlapping while for the common motifs 
problem we slightly improve the time and space complexity. 



4.1 The Repeated Motifs Problem 

We are given a weighted sequence s and four integers 0<k<c, e>0, l>2 
and q > 2, for some constant c, and we want to find all models m of length I 
with probability of occurrence > ^ such that m is present at least q times in s 
and the Hamming distance between all occurrences of to is < e. The occurrences 
must not overlap. 

First the weighted suffix tree of s is constructed given that the minimum 
probability of occurrence is p This construction is accomplished in linear time 
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and space. Then, we spell all models of length I on the tree. We do this in the 
same way as Sagot [12], so we are not going to elaborate on this procedure. The 
idea is that each time we extend a model m' of length < I by one character 
either with a match or with a mismatch if the total number of errors in m' is 
< e. This procedure continues until we reach length I or the number of errors 
becomes larger than e. 

The main problem to tackle is the non-overlapping constraint. We accomplish 
this by filtering the output of the algorithm on the WST. Since we only consider 
Hamming distance and no insertions or deletions are allowed, for the specific 
problem only nodes with path labels of length I will be considered. 

Assume that the nodes with path label of length I constitute a set L = 
vi,V 2 , ■ ■ ■ , v\L\ ■ For each v G L, the leaves of its subtree are put in a sorted list 

These lists are implemented as van Emde Boad trees [14]. Since the numbers 
sorted are in the range [l,n], we can sort them in linear time. As a result, the 
time complexity for this step will be X)l=i Since all lists are disjoint, this 

sum is bounded by 0{n). 

Assume that L' = Vi^,Vi^, , Vi^ G L are the nodes of path label with length 
I that constitute a candidate model m. First we check whether the sum of their 
leaves is larger than q. If it is not, then the model is not valid since the quorum 
constraint is not satisfied. If there are at least q leaves, then we have to check 
whether the non-overlapping constraint is also satisfied. 

The naive solution would be to merge all lists . . . ,v\. and perform q 

queries. In this case, the time complexity for q queries would be O(gloglogn) 
(the log log n factor is by the van Emde Boas trees) but the merge step requires 
0{n) time, which is very inefficient. We do this as follows: 

We check among all nodes in L' to find the one with the minimum position 
of occurrence. This can be easily implemented in 0{\L'\) time, since the lists 
for each node are sorted and we check only the first element. Assume that this 
element is position xi on the string s. Then, among all lists we check to find 
the successor of value X 2 = x\ + \m\ + 1 and we keep doing this until the 
quorum constraint is satisfied (the final query will be of the form > Xq-\ + 
|m| -I- 1). This solution has 0{q\L\ log log |n|) time complexity which leads to an 
0{nV'^{e,l)q\og\ogn) time solution for the repeated motifs problem, for length 
I using linear space. 

This problem, can be seen as a static data structure problem, which we call 
the multiset dictionary problem. 

Definition 4. Given a superset S = {51, S 2 , ■ . . , 5a,}, of sets Si C (1,2,..., n}, 
we want to answer q successor queries on the subset 5' = , 5^2 , . . . , Si^ } G 5, 

where n = Yli=i 

This problem can be seen as a generalization of the iterative search problem 
[3] . In this problem, we are given a set of N catalogs and we are asked to answer 
N queries, one on each catalog. The straightforward solution is to search in each 
catalog, which means that the time complexity will be 0{N log n), if each catalog 
has size n. If we apply the fractional cascading technique [3], then the time 
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complexity will become 0{N + logn). Unfortunately, we cannot do the same in 
the multiset dictionary problem, since we do not know in advance which catalogs 
we are going to use, while at the same time the queries are not confined to a 
single catalog but to their union. This problem is an interesting data structure 
problem and it would be nice to see solutions with better complexity than the 
rather trivial 0 (( 7 a; log logn). 

4.2 A Note on the Common Motifs Problem 

We are given a set of N weighted sequences S = Si {1 < i < N) and four integers 
0<k<c, e>0, l>2 and q > 2, for some constant c, and we want to find 
all models m of length I with probability of occurrence larger than such that 
m is present at least in q strings in S and the Hamming distance between all 
occurrences of m is < e. 

First, the generalized weighted suffix tree of S is constructed given that the 
minimum probability of occurrence is ■^. This construction is accomplished in 
linear time and space for a small constant k. Then, we spell all models of length 
I on the tree. We do this in the same way as Sagot [12], so we are not going 
to elaborate on this procedure. The idea is that each time we extend a model 
m! of length < I by one character either with a match or with a mismatch if 
the total number of errors in m' is < e. This procedure continues until we reach 
length I or the number of errors becomes larger than e or the quorum constraint 
is violated. We sketched in 2.2 a solution with 0{nN‘^V{e,l)) time complexity 
using O(n^) space. We sketch an algorithm that reduces a factor N to q. 

This additional N factor in the space and time complexity comes from the 
check of the quorum constraint. Sagot uses a bit vector of length N to do that. 
However, note that if a node has q different strings then all its ancestors will 
certainly contain q strings. In addition, we do not care for the exact number 
of strings in the subtree as far as this number is larger than q. In this way we 
attach an array of integers of length q to each internal node. If this array gets 
full, then the quorum constraint is satisfied for all its ancestors and we do not 
need to keep track of other strings. 

We fill these arrays by traversing the suffix tree in a post-order manner. 
Assume the jUj at most children of a node v. Assume that their arrays are 
sorted. If one of the children of v has a full array then v will also have a full 
array. In the other case, we merge all these arrays without keeping repetitions. 
This can be easily accomplished in 0{\S\q) time. Since the number of internal 
nodes will be 0{nN) then the pre-processing time is 0{nNq) while the space 
complexity of the suffix tree will be 0{nNq) (less than 0{nN‘^) of [12]). Finally, 
the time complexity of the algorithm will be 0{nNqV (e, 1)), which is better than 
[12] since q is at most equal to N. 

5 Discussion and Further Work 

The algorithms we have presented in this paper solve various instances of the 
motif identification problem in weighted sequences, which is very important in 
the area of protein sequence analysis. 
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Our future research is to tackle the structured motifs identification problem 
in weighted sequences. In this paper we described an algorithm to compute the 
maximal pairs in weighted sequences. We would like to extend this algorithm for 
the extraction of general structured motifs composed of p > 2 parts. 
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Abstract. This paper presents a generalization of the notion of longest 
repeats with a block of k don’t care symbols introduced by [8] (for k 
fixed) to longest motifs composed of three parts: a hrst and last that 
parameterize match (that is, match via some symbol renaming, initially 
unknown), and a functionally equivalent central block. Such three-part 
motifs are called longest block motifs. Different types of functional equiv- 
alence, and thus of matching criteria for the central block are considered, 
which include as a subcase the one treated in [8] and extend to the case 
of regular expressions with no Kleene closure or complement operation. 
We show that a single general algorithmic tool that is a non-trivial ex- 
tension of the ideas introduced in [8] can handle all the various kinds of 
longest block motifs defined in this paper. The algorithm complexity is, 
in all cases, in O(nlogn). 



1 Introduction 

Crochemore et al. [8] have recently introduced and studied the notion of longest 
repeats with a block of k don’t care symbols, where k is fixed. These are words 
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of the form V <>^ W that appear repeated in a string X, where is a region of 
length k with an arbitrary content. Their work has some relation with previous 
work on repeats with bounded gaps [5,12]. In general, the term motif [9] is 
used in biology to describe similar functional components that several biological 
sequences may have in common. It can also be used to describe any collection of 
similar words of a longer sequence. In nature, many motifs are composite, i.e., 
they are composed of conserved parts separated by random regions of variable 
lengths. By now, the literature on motif discovery is very rich [4], although a 
completely satisfactory algorithmic solution has not been reached yet. 

Even richer (see [15-17]) is the literature on the characterization and detec- 
tion of regularities in strings, where the object of study ranges from identification 
of periodic parts to identification of parts that simply appear more than once. 
Baker [2, 3] has contributed to the notion of parameterized strings and has given 
several algorithms that find maximal repeated words in a string that p-match, 
i.e., that are identical up to a renaming (initially unknown) of the symbols. Pa- 
rameterized strings are a successful tool for the identification of duplicated parts 
of code in large software systems. These are pieces of code that are identical, 
except for a consistent renaming of variables. Motivated by practical as well as 
theoretical considerations, Amir et al. [1] have investigated the notion of func- 
tion matching that incorporates parameterized strings as a special case. Such 
investigations of words that are “similar” according to a well defined correspon- 
dence hint at the existence of meaningful regularities in strings, such as motifs, 
that may not be captured by standard notions of equality. 

In this paper, we make a first step in studying a new notion of motifs, where 
equality of strings is replaced by more general “equivalence” rules. We consider 
the simplest of such motifs, i.e., motifs of the form with k fixed, which we 

refer to as block motifs. One important point in this study is that the notation o^, 
which usually indicates a don’t care block of length k, assumes in the case of the 
present paper a new meaning. Indeed, is now a place holder stating that, for 
two strings described by the motif, the portion of each string going from position 
[yj-l-ltolPl-l-fc— 1, referred to as the central block, must match according 
to a specified set of rules. To illustrate this notion, consider ab ab and the 
rule stating that any two strings described by the motif must have their central 
block identical, up to a renaming of symbols. For instance, abxyab and ababab 
are both described by ab ab and the given rule, since there is a one-to-one 
correspondence between {x,y} and {a,b}. Notions associated with the example 
and the intuition just given are formalized in Section 3, where the central block 

is specified by a set of matching criteria, all related to parameterized strings 
and function matching. Moreover, our approach can be extended to the case 
where such central block is a fixed regular expression, containing no Kleene 
closure or complement operation. Our main contribution for this part is a formal 
treatment of this extended type of motifs, resulting in conditions under which 
their definition is sound. 

At the algorithmic level, our main contribution is to provide a general algo- 
rithm that extracts all longest block motifs, occurring in a string of length n. 
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in O(nlogn) time. Indeed, for each of the matching criteria for the central part 
presented in Section 3 the general algorithm specializes to find that type of motif 
by simply defining a new lexicographic order relation on strings. We also show 
that the techniques in [8], in conjunction with some additional ideas presented 
here, can be naturally extended to yield a general algorithmic tool to discover 
even subtler repeated patterns in a string. 

Due to space limitations, proofs will either be omitted or simply outlined. 
Moreover, we shall discuss only some of the block motifs that can be identified 
by our algorithm. 

2 Preliminaries 

2.1 Parameterized Strings 

We start by recalling some basic definitions from the work by Brenda Baker 
on parameterized strings [2,3]. Let S and II be two alphabets, referred to as 
constant and parameter, respectively. A p-string A is a string over the union of 
these two alphabets. A p-string is therefore just like any string, except that some 
symbols are parameters. In what follows, for illustrative purposes, let S = {a, b} 
and n = {u,v,x,y}. Baker gave a definition of matching for p-strings, which 
reduces to the following: 

Definition 1. Two p-strings X and Y of equal length p-match if and only if 
there exists a bijective morphism G : X U II ^ X U II such that G{a) = a for 
aeX andY, = G{Xi), Vi € [l..|A|]. 

For instance, X = abuvabuvu and Y = abxyabxyx p-match, with G such 
that G{u) = X and G{v) = y. 

For ease of reference, let X\ = X \J II . From now on, we refer to p-strings 
simply as strings over the alphabet Xi and, except otherwise stated, we assume 
that the notion of match coincides with that of p-match. We refer to the usual 
notion of match for strings as exact match. In that case, X\ is treated as a 
set of constants. Moreover, we refer to bijective morphims over X\ as renaming 
functions. We also use the term prefix, suffix and word in the usual way, i.e., 
the z-th suffix of X is XiXi+i ■ ■ ■ Xn, where n is the length of the string. In what 
follows, let X denote its reverse, i.e., Xn - ■ ■ xi. 

We need to recall the definition of parameterized suffix tree, denoted by p- 
suffix tree, also due to Baker [2,3]. Its definition is based, among other things, 
on a transformation of suffixes and prefixes of a string such that, when they 
match, they can share a path in a lexicographic tree. Indeed, consider the string 
Y = uuuvvv, made only of the parameters u and v. Notice that uuu and vvv p- 
match, and therefore they should share a path when the suffixes of the string are 
“stored” in a (compacted or not) lexicographic tree. That would not be possible 
if the lexicographic tree were over the alphabet Xi. We now briefly discuss the 
ideas behind this transformation. Consider a new alphabet X 2 = X[J N, where 
N is the set of nonnegative integers. 
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Let prev be a transformation function on strings operating as follows on a 
string X. For each parameter, its first occurrence in X is replaced by 0, and 
each successive occurrence is represented by its distance, along the string, to the 
previous occurrence. Constants are left unchanged. We denote by prev{X) the 
prev representation of X over the alphabet L' 2 . 

The prev function basically substitutes parameters with integers, leaving the 
constants unchanged, z.e., it transforms strings over Ei into strings over X 2 - For 
example, prev{abxyxzaaya) = a60020aa5a. 

The notion of match on strings corresponds to equality in their prev repre- 
sentation [2,3]: 

Lemma 2. Two strings X and Y p-match if and only if prev{X) =prev(Y). 
Moreover, these two strings are a match if and only if X and Y are. 

Notice that the prev representation of two strings tells us nothing about 
which words, in each string, are a p-match. For instance, consider abxyxzaaya 
and zzzztzwaata. Words xyxzaaya and ztzwaata match, but that cannot be 
directly inferred from the prev representation of the two full strings. 

Let X be a string that ends with a unique endmarker symbol. A parameter- 
ized suffix tree for X (p-suffix tree for short) is a compacted lexicographic tree 
storing the prev representation of all suffixes of X. 

The above definition is sound in the sense that all factors of X are represented 
in the p-suffix tree (that follows from the fact that each such word is prefix of 
some suffix). Even more importantly, matching factors share a path in the tree. 
Indeed, consider two factors that match. Assume that they are of length m. 
Certainly they are prefixes of two suffixes of X . When represented via the prev 
function, these two suffixes must have equal prefixes of length at least m (by Fact 
2). Therefore, the two words must share a path in the p-suffix tree. Consider again 
Y = uuuvvv. Notice that prev{uuuvvv) = 012012 and that prev{vvv) = 012, so 
uuu and vvv can share a path in the p-suffix tree. 

For later use, we also need to define a lexicographic order relation on the prev 
representation of strings. It reduces to the usual definition when the string has 
no parameters. Consider the alphabet S 2 and let <2 denote the standard lexi- 
cographic order relation for strings over a fixed alphabet: the subscript indicates 
to which alphabet the relation refers to. 

Definition 3. Let X and Y he two strings. We say that X is lexicographically 
smaller than Y if and only if prev{X) <2 prevfY). We indicate such a relation 
via < 2 . 

2.2 Matching via Functions 

In what follows, we need another type of relation that, for now, we define as a 
Table. A Table T has domain Si and ranges over the power set of S\. 

Definition 4. Given two tables T and T' and two strings X and Y of length n, 
we say that X table matches Y via the two tables T and T' , or, for short, that 
X and Y t-match, if and only if yi G T{xi) and Xi G T'{yi), for all 1 < i < n. 
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For instance, let T{a) = T{b) = {x,v,y}, T'{a) = T'{u) = {a} and 

T'(x) = T'{v) = T'{y) = {&}. Then X = aaabbb and Y = auaxvy t-match. 

A first difference between table and parameterized matches is that in the 
first case, all symbols in Si are treated as parameters and the correspondence 
is fixed once and for all. 

A more substantial difference between table and parameterized matches is 
that tables may not be functions (as in the example above). For arbitrary ta- 
bles, t-matching is also in general not an equivalence relation. Indeed, although 
symmetry is implied by the definition, neither reflexivity nor transitivity are. 
Notice also that t-matching incorporates the notion of match with don’t care. 
In this latter case, both tables assign to each symbol the don’t care symbol. We 
call this table the don’t care table. 

3 Functions and Block Motifs 

We now investigate the notion of block motif, which was termed repeat with a 
block of don’t cares in [8], in conjunction with that of parameterized and table 
match. 

Let T be a family of tables and k an integer, with 0 < k < n, where n is the 
length of a string X. Consider also a family of renaming functions. 

Definition 5. Let Y he a factor of X. Y is a general k-repeat if and only if 
the following conditions hold: (a) Y can he written as V QW , V and W both 
non-empty and |(5| = k; (b) there exists another word Z of X , two renaming 
functions F and G and two tables in T , such that Z = F{V)Q' G{W) and Q and 
Q' t-match, via the two tables. 

Definition 6. Let R{k,i,j) he the following binary relation on strings of length 
m, with l<i<j-\-l, j < m and k = j — i-\-l:Z R{k, i,j) Y if and only if 
{ziZ 2 ■ ■ ■ z^-l), (zj+i ■ ■ ■ Zm) and ( 1 / 11/2 • • • 2/i-i), (j/i-i-i • • • 2/m) match, respectively, 
while (zi - ■ ■ Zj) and {yi - • • yj) t-match via two not necessarily distinct tables in T. 

We now give a formal definition of motif. Intuitively, it is a representative 
string that describes multiple occurrences of “equivalent” strings. 

Definition 7. Given a string X , consider a factor Y of X , of length m, and 
assume that it is a general k-repeat. Let i and j be as in Definition 6 and con- 
sider all factors Z of X such that Y R{k,i,j) Z. Assume that R{k,i,j) is an 
equivalence relation. Then, for each class with at least two elements, a block mo- 
tif is any arbitrarily chosen word in that class, say Y . As for standard strings, 
the block motif can be written as yiy 2 • • • 2/i-i 2/j+i ' ’ ' Um, once it is under- 

stood that is a place holder specifying a central part of the motif and that the 
matching criterion for that part is given by the family of tables. 

For instance, restrict the family of tables to be the don’t care table only. Let 
Z = abvvva and Y = abxxya] then we have Z i?(2,3,4) Y with the identity 
function for the prefix ab and G{v) = y and G(a) = a for the suffix of length 
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2. Moreover, consider X = Y Z. Then, ab va is a block motif. Also ab ya 
is a block motif, but it is equivalent to the other one, given the choices made 
about the family of tables and the fact that we are using a notion of match via 
renaming. 

We now investigate the types of table families that allow us to properly define 
block motifs. As it should be clear from the example discussed earlier, the notion 
of block motifs, as defined in [8], is a special case of the ones defined here. It is 
also clear that the family of all tables yields the same notion of block motif as the 
one with the don’t care table only. However, it can be shown that exclusion of 
the don’t care table is not enough to obtain a proper definition of block motifs. 
Fortunately, there are easily checkable sufficient conditions ensuring that the 
family of tables guarantees R to be an equivalence relation, as we outline next. 

Definition 8. Consider two tables T and T' . Let their composition, denoted by 
o, be defined by T oT'{a) = UcGr'(a) each symbol a in the alphabet. 

Tha family T is closed under composition if and only if, for any two tables in 
the family, their composition is a table in the family. 

Definition 9. A table T contains a table T' if and only ifT'{a) C T{a), for 
each symbol a in the alphabet. 

Lemma 10. Assume that T is closed under composition and that there exists a 
table in T containing the identity table. Then R is an equivalence relation. 

We now consider some interesting special classes of table functions, in par- 
ticular four of them, for which we can define block motifs. Let 7^ consist only of 
the don’t care table. Let 7^ and 7^ consist of renaming functions and many-to- 
one functions, respectively. In order to define the fourth family, we need some 
remarks. 

The use of tables for the middle part of a block motif allows us to specify 
simple substitution rules a bit more relaxed than renaming functions. We discuss 
one of them. Let us partition the alphabet into classes and let V denote the 
corresponding partition. We then define a partition table 7p that assigns to each 
symbol the class it belongs to. For instance, fix two characters in the alphabet, 
say a and b. Consider the table, denoted for short Ta,t, that assigns {a,b} to 
both a and b and the symbol itself to the remaining characters. In a sense, Tp 
formalizes the notion of groups of characters being interchangeable, or equivalent. 
Such situations arise in practice (see for instance [6,11,13,14,19,21,22]), in 
particular in the study of protein folding. 

Let the fourth family of tables consist of only Tp, for some given partition V 
of the alphabet Si . 

Lemma 11. Pick any one of or Tp and consider the relation R in 

Definition 6 for the chosen family. R is an equivalence relation. In particular, 
when the chosen family is Tm, R is the same relation as that for T. Therefore, 
for all those tables one can properly define block motifs. 
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Let the family of tables be one-to-one functions. Consider X = YZ, where 

Y = abxxya and Z = abvvva. Then, ab va and ab ya are block motifs 
representing the same class, the one consisting of Y and Z. We can pick any 
one of the two, since they are equivalent. Notice that the rule for the central 
part states that the corresponding region for two strings described by the motifs 
must be each a renaming of the other. 

Let the family of tables be Ta^b, defined earlier. Let Z = cdccdacdc and 

Y = cdccdbcdc. Let X = ZY. Then cdc cdc is a block motif, representing 
both Y and Z. Again, the rule for the central part states that the corresponding 
region for two strings described by the motif must be identical, except that a 
and b can be treated as the same character. 

4 Longest Block Motifs with a Fixed Partition Table 

We now give an algorithm that finds all longest block motifs in a string, when we 
use a partition table, known and fixed once and for all. The algorithm is a non- 
trivial generalization of the one introduced in [8]. In fact, we show that the main 
techniques used there, and that we nickname as the two-tree trick, represent a 
powerful tool to extract longest block motifs in various settings, when used in 
conjunction with the algorithmic ideas presented in this section. 

Indeed, a verbatim application of the two-tree trick would work on the p- 
suffix trees for the string and its reverse. Unfortunately, that turns out to be 
not enough in our setting. We need to construct a tree somewhat different from 
a p-suffix tree, which we refer to as a p-suffix tree on a mixed alphabet. Using 
this latter tree, the techniques in [8] can be extended. Moreover, due to the 
generality of the algorithm constructing this novel version of the p-sufRx tree, 
all the techniques we discuss in this section extend to the other three types of 
block motifs defined in section 3, as it is briefly outlined in section 5. 

For each class in V, select a representative. The representatives give a reduced 
alphabet E 3 . For any string Y, let Y be its corresponding string on the new 
alphabet, obtained by replacing each symbol in Y with its representative. In 
what follows, for our examples, we choose Ta^b, with a as representative. Consider 
a string X and assume that it has block motif V W, with respect to table 
Tp. We recall that U lU is a shorthand notation for the fact that strings in 
the class (a) t-match in the positions corresponding to the central part and, (b) 
they (parameterize) match in the positions corresponding to V and W. We are 
interested in finding all longest block motifs. 

Consider a lexicographic tree T, storing a set of strings. Let T be a string. 
The locus u of T in T, if it exists, is the node such that Y matches the string 
corresponding to the path from the root of T to u. Notice that when T is a 
p-suffix tree, then prev(Y) must be the string on the path from the root to u. 
For standard strings, the definition of locus reduces to the usual one. With those 
differences in mind, one can also define in the usual way the notion of contracted 
and extended locus of a string. Moreover, given a node u, let d{u) be the length 
of the string of which u is locus. 
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4.1 A p-SufRx Tree on a Mixed Alphabet 

Definition 12. The modified prev representation of a string Y , mprev(Y), is 
defined as follows. If\Y\ < k, then it is Y. Else, it is Wprev(Z), where Y = WZ 
and \W\ = k. 

For instance, let Y = abauuxx, and k = 3. Then, its modified prev represen- 
tation is mpreviY) = aaaOlOl. 

Definition 13. Let X he a string with a unique endmarker. Let be a lex- 
icographic tree storing each suffix of X in lexicographic order, via its mprev 
representation. That is, T'^ is like a p-suffix tree, but the initial part of each 
suffix is represented on the reduced alphabet. 

For instance, let X = abbabbb and k = 2, the first suffix of X is stored as 
aababbb. 

Notice that has 0{n) nodes, since it has n leaves and each node has 
outdegree at least two. We anticipate that we only need to build and use the 
topology of T^, since we do not use it for pattern matching and indexing, as it 
is costumary for those data structures. 

We now show how to build T^ in 0(n log n) time. Let BuildTree be a pro- 
cedure that takes as input the n suffixes of X and returns as output T'^. The 
only primitive that the procedure needs to use is the check, in constant time, for 
the lexicographic order of two suffixes, according to a new order relation that 
we define. The check should also return the longest prefix the two suffixes have 
in common, and which suffix is smaller than the other. 

Definition 14. Let Y and Z be two strings. Let <3 he a lexicographic order 
relation over A 3 . We define a new order relation Y <„ Z as follows. When 
|T| < k, it must be Y <3 U, where U is a prefix of Z and \U\ = |F|. Assume 
that |T| > k, and let Z = US and Y = RP, with |i?| = |[/| = k. Then, it must 
be R <3 U or R = U but prev{P) <2 prev{S). Abusing notation, we can write 
that mprev{Y) <m mprev{Z), when Y <„ Z. 

Let r be a tree and consider two nodes u and v. Let LCA{u, v) denote the 
lowest common ancestor of u and v. Given the suffix tree [18] and the p- 
suffix tree Tx, assume that they have been processed to answer LCA queries in 
constant time [10,20]. Then, it is easy to check, in constant time, the <m order 
of two suffixes of X, via two LCA queries in those trees. Moreover, that also 
gives us the length of the matching prefix. The details are omitted. We refer to 
such an operation as compare(i, j), where i and j are the suffix positions. It 
returns which one is smaller and the length of their common prefix. 

Now, BuildTree works as follows. It simply builds the tree, without any 
labelling of the edges, as it is usual in lexicographic trees. 
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ALGORITHM BuildTree 

1. Using compare and the <m relation, sort the suffixes of X with, say, Heapsort 

[ 7 ]. 

2. Process the sorted list ii, - ■ ■ ,in of suffixes in increasing order as follows: 

2.1 When the first suffix is processed, create a root and a leaf, push them in 
a stack in the order they are created. Label the leaf with U. 

2.2 Assume that we have processed the list up to ig and that we are now 
processing ig+i- Assume that on the stack we have the path from the 
root to leaf labeled ig in the tree built so far, from bottom to top. Let it 
be ui,U 2 , ■ ■■ ,Us. 

2.2.1 Using compare and the <m relation, find the longest prefix that ig 
and ig+i have in common. Let Z denote that prefix and d be its 
length. 

2.2.2 Pop elements from the stack until one finds two such that d{ui) < 
d < d{ui+i). Pop Ui+i from the stack. If d{ui) = d, then Ui is the 
locus of Z in the tree built so far. Else, Ui and Ui+i are its contracted 
and extended locus, respectively. If Ui is the locus of Z, add a new 
leaf labeled ig+i as offspring of Ui and push it on the stack. Else, 
create a new internal node u, as locus of Z, add it as offspring of Ui 
and make Ui+i an offspring of u. Moreover, add a new leaf labeled 
ig+i as offspring of u and push the new created nodes on the stack, 
in the order in which they were created. We now have on the stack 
the path from the root to the leaf labeled ig+i- 

Lemma 15. Tree can he correctly built in 0{nlogn) time. 

4.2 The Algorithm 

Consider the trees and T^, where the latter one is a p-suffix tree. For each 
leaf labeled i in T^, change its label to be n + 2 — f, so that whenever the left 
part of a block motif starts at i in A, we have the position in X where the right 
part starts, including the central part. We refer to those positions as twins. Visit 
T'x in preorder. Consider the two leaves G and ^2 G T-^, corresponding to 
a pair of twins. Assign to i .2 the same preorder number as that of t\. Let V o^W 
be a block motif and let i be one of its occurrences in X, i.e., where it starts. In 
order to simplify our notation, we refer to such an occurrence via the preorder 
number of the leaf assigned to z+ |U| + 1 in T^. From now on, we shall simply be 
working with those preorder numbers. Indeed, given the tree we are in, we can 
recover the positions in A or A corresponding to the label at a leaf in constant 
time, by suitably keeping a set of tables. The details are as in [8] . Moreover, we 
can also recover the position where a block motif occurs, given the block motif 
and the preorder number assigned to the position. Given a tree T, let L{v) be 
the list of labels assigned to the leaves in the subtree rooted at v. For the trees 
we are working with, those would be preorder numbers. 
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Definition 16. We say that V <>^ W is maximal if and only if extending any 
word in the class, both to the left and to the right, results in the loss of at least 
one element in the class. That is, by extending the strings in the class, we can 
possibly get a new block motif, but its class does not contain that of V W. 

For instance, let X = aabbaxxbxababyyayabbbuu. Block motif abo^xx is max- 
imal. Indeed, it represents the class of words {abbaxx, ababyy, abbbuu}. However, 
extending any of those words either to the right and to the left results in a smaller 
class. 

Lemma 17. Consider a string X , its reverse X, the trees and T^. Assume 

that V W is maximal. Pick any representative in the class, say VQW. Then 
V and mprev{QW) have a locus u in T-^ and v in T^, respectively. Moreover, 
all the occurrences ofVo^W are in L{u) P| L{v). Conversely, pick two nodes u' 
and v' , in T-^ and T'^, respectively. Assume that there are at least two labels i 
and j in L{u') p| L{v') such that LCA{i,j) = u' and LCA{i,j) = v' , in T-^ and 
T^, respectively. Assume also that d{v') > k. Then, they are occurrences of a 
maximal block motif. 

We also need the following: 

Lemma 18. Consider an internal node v in T-^ and two of its offsprings, say, 
vi and t>2. Let ji, j2, ■ ■ ■ ,jm be the sorted list of labels assigned to the leaves in 
the subtree rooted at Vi and let i be a label assigned to any leaf in V 2 . Let g be the 
first index such that jg < i. Similarly, let c be the first index such that i < jc- The 
maximal block motif of maximum length that i forms with ji, j2, ■ ■ ■ ,jm is either 
with jg, if it exists, or with jc, if it exists, provided that either d{LCA{i,jg)) > k 
or d{LCA{i,jc)) > k and the LCA is computed on T'-^. 

We now present the algorithm. 

ALGORITHM LM 

1. Build T-j^ and T'^. Visit Tj. in preorder and establish a correspondence be- 
tween the preorder numbers of the leaves in T'^ and the leaves in T^. Trans- 
form T-y into a binary suffix tree B (see [8]); 

2. Visit B bottom up and, at each node, merge the sorted lists of the labels 
(preorder numbers in T'^) associated to the leaves in the subtrees rooted at 
the children. Let these lists be Ai and .42 and assume that |.4i| < |.42|. 
Merge Ai into A 2 . Any time an element i of the first list is inserted in the 
proper place in the other, e.g., jg and jc in Lemma 18 are identified, we 
only need to check for two possibly new longest maximal block motifs that 
i can generate. While processing the nodes in the tree, we keep track of the 
longest maximal block motifs found. 

Theorem 19. ALGORITHM LM correctly identifies all longest block motifs in a 
string X , when the matching rule for the central part is given by a partition 
table. Lt can be implemented to run in 0{nlogn) time. 
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Proof. The proof of correctness comes from Lemma 18. The details of the anal- 
ysis are as in [8] with the addition that we need to build both Tj^ and T^, which 
can be done in O(nlogn) time ([3, 18] and Lemma 15). □ 

5 Extensions 

In this Section we show how to specialize the algorithm in Section 4 when the 
central part is specified by %. All we need to do is to define a lexicographic 
order relation, analogous to the one in Definition 14. In turn, that will enable 
us to define a variant of the tree T^, which can still be built in 0(n log n) 
time with Algorithm BuildTree and used in Algorithm LM to identify block 
motifs with the don’t care symbol. We limit ourselves to define the new tree. An 
analogous reasoning will yield algorithms dealing with a central part defined by 
either renaming functions or by regular expressions with no Kleene Closure or 
Complement operation. The details are omitted. For the new objects we define, 
we keep the same notation as for their analogous in Section 4. 

Let * be a symbol not belonging to the alphabet and not matching any other 
symbol of the alphabet. Consider Definition 12 and change it as follows: 

Definition 20. The modified prev representation of a string Y , mprevfY), is 
defined as follows. If |F| = m < k, then it is Else, it is *^prev{Z), where 

Y = WZ and jWj = k. 

For instance, let Y = abauuxx, and fc = 3. Then, its modified prev represen- 
tation is mprevfY) = * * *0101. 

We now define another lexicographic tree, still denoted by T^. Consider 
Definition 13 and change it as follows: 

Definition 21. Let X he a string with a unique endmarker. Let T'^ be a lexi- 
cographic tree storing each suffix of X , via their mprev representation according 
to Definition 20. That is, T'^ is like a p-suffx tree, but the initial part of each 
suffix is represented with * ’s. 

For instance, let X = abbabbb and k = 2, the first suffix of X is stored as 
* * babbb. 

Finally, consider Definition 14 and change it as follows: 

Definition 22. Let Y and Z he two strings. We define a new order relation 

Y <m Z as follows. When |F| < k, it must be jT] < \Z\. Assume that jT] > k, 
and let Z = US and Y = RP, with |i?| = \U\ = k. Then, it must he prev{P) <2 
prev{S). With a little abuse of notation, we can write mprevfY) <„ mprev{Z). 

Observe that Algorithm BuildTree will work correctly with this new defi- 
nition of lexicographic order, except that now, in order to compare suffixes, we 
need only the p-suffix tree T^. Finally, the results in Section 4.2 hold verbatim: 

Theorem 23. ALGORITHM LM correctly identifies all longest block motifs in a 
string X , when the matching rule for the central part is given by the don’t care 
table. It can he implemented to run in O(nlogn) time. 
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Abstract. Evolution acts in several ways on biological sequences: either 
by mutating an element, or by inserting, deleting or copying a segment 
of the sequence. Varre et al. [12] defined a transformation distance for 
the sequences, in which the evolutionary operations are copy, reverse 
copy and insertion of a segment. They also proposed an algorithm to 
calculate the transformation distance. This algorithm is O(n^) in time 
and O(n^) in space, where n is the size of the sequences. In this paper, 
we propose an improved algorithm which costs 0{n^) in time and 0{n) 
in space. Furthermore, we extend the operation set by adding deletions. 
We present an algorithm which is 0(n^) in time and 0(n) in space for 
this more general model. 



1 Introduction 

Building models and tools to quantify evolution is an important domain of biol- 
ogy. Evolutionary trees or diagrams are based on statistical methods which ex- 
ploit comparison methods between genomic sequences. Many comparison models 
have been proposed according to the type of physico-chemical phenomena that 
under ly the evolutionary process [5]. Different evolutionary operation sets are 
studied. Mutation, deletion and insertion were the first operations dealt with 
[7]. Duplication and contraction were then added to the operation set [2, 1]. All 
these operations were acting on single letters, representing bases, aminoacids or 
more complex sequences: they are called point transformations. Segment opera- 
tions are also very important to study. In a number of papers [13, 12, 11], Varre et 
al. have studied an evolutionary distance based on the amount of segment moves 
that Nature needed (or is supposed to have needed) to transfer a sequence from 
one species to the equivalent sequence in another one. Their model is concerned 
with segments copy with or without reversal and on segment insertion: it is 
thus a very simple and robust model which can easily be explained from bio- 
logical mechanisms (similar or simpler models had been previously discussed by 
Schoniger and Waterman [8] and Morgenstern et al. [6]). They developed this 
study on DNA sequences, but the basic concepts and algorithms apply as well 
to other biological sequences like proteins or satellites. 

The algorithm they propose to compute the minimal transformation sequence 
is based on an encoding into a graph formalism, from which one can get the so- 
lution by computing shortest paths. This gives an O(n^) answer both in space 
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and time. In fact it is possible to give a direct solution based on dynamic pro- 
gramming which costs only O(n^) in time and and 0(n) in space^. This solution 
is obviously more efficient for long sequences and makes the problem tractable 
even for very long sequences. 

In the second section we describe the model and the problem description. 

In the third section our algorithm for calculating the transformation distance 
is presented. Our algorithm is a based on dynamic programming algorithm. 

In section 4, we introduce the deletions in our model and we give an algorithm 
to solve the extended transformation distance problem in presence of deletions: 
this algorithm runs in time 0{n^) and space 0{n). 

In section 5, by using the biological sequences we justify the concept of ex- 
tended transformation distance problem. 

Finally, section 6 is dedicated to conclusion and remarks. 



2 Model and Problem Description 

The symbols are elements from an alphabet S. The set of all finite-length strings 
formed using symbols from alphabet S is denoted by S* . In this paper, we use 
the letters x, y, z,... for the symbols in S and S, T, P, R, ... for strings over E* . 
The empty string is denoted by e. The length of a string S is denoted by [S']. 
The concatenation of a string P and R, denoted PR, has length |P| -|- |i?| and 
consists of the symbols from P followed by the symbols from R. 

We will denote by S'[t] the symbol in position i of the string S (the first symbol 
of a string S is S'[l]). The substring of S starting at position i and ending at 
position j is denoted by S[i..j] = 1] . . . S'[j]. The reverse of a string S is 

denoted by Thus, if n is the length of S, = 5'[(n— j-|-l)..(n— 

and S[i..j]~^ = — j + l)..(n — i + 1)]. i? is a substring of S if and only 

if R~^ is a substring of S~^ . We say that a string P is a prefix of a string S, 
denoted P 'Q S, ii S = PR for some string R G S*. Similarly, we say that a 
string P is a suffix of a string S, denoted by P □ S', if S' = RP for some R G S* . 
Note that P is a prefix of S if and only if P~^ is a suffix of S~^. For brevity of 
notation, we denote the fc-symbol prefix P[l..fc] of a string pattern P[l..m] by P^. 
Thus, Po = e and Pm = P = P[l..m]. We recall the definition of a subsequence: 
Given a string S[l..n], another string R[l..k] is a subsequence of S, denoted by 
P ^ S, if there exists a strictly increasing sequence < ■ ■ ■ ,ik > of indices 

of S such that for all j = l,2,...,k, we have S[ij] = R[j]. For example, if 
S = xxyzyyzx, R = zzxx and P = xxzz, then P is a subsequence of S, while 
R is not a subsequence of S. When a string S is a subsequence of a string T, 
T is called a supersequence of S, denoted hy T > S. In the last example. S' is a 
supersequence of P. 

Varre et al. [12, 11] propose a new measure which evaluates segment-based 
dissimilarity between two strings: the source string S and the target string T. 
This measure is related to the process of constructing the target string T with 

In this paper, n is the maximum size of the sequences. 
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segment operations^. The construction starts with the empty string e and pro- 
ceeds from left to right by adding segments (concatenation), one segment per 
operation. The left-to-right generation is not a restriction if the costs of opera- 
tions are independent of the time (which is the case in this problem). A list of 
operations is called a script. Three types of segment operations are considered: 
the copy adds segments that are contained in the source string S, the reverse 
copy adds the segments that are contained in S in reverse order, and the inser- 
tion adds segments that are not necessarily contained in S. The measure depends 
on a parameter that is the Minimum Factor Length (MFL); it is the minimum 
length of the segments that can be copied or reverse copied. 

Depending on the number of common segments between S and T, there exist 
several scripts for constructing the target T. Among these scripts, some are more 
likely; in order to identify them, we introduce a cost function for each operation. 
InsertCost{T[i..j]) is the cost of insertion of substring T[i..j]. CopyCost{T[i..j]) 
is the cost of copying the segment T[i..j] from S if it is contained in S. Finally 
RevCopyCost(T[i..j]) is the cost of copying substring T[i..j] from S if the reverse 
of this substring is contained in the source S (which means this string is contained 
in 5*“^). The cost of a script is the sum of the costs of its operations. The minimal 
scripts are all scripts of minimum cost and the transformation distance^ (TD) 
is the cost of a minimal script. The problem which we solve in this paper is the 
computation of the transformation distance. It is clear that it is also possible to 
get a minimal script. 

3 Algorithm 

In this section we describe the algorithm to determine the transformation dis- 
tance between two strings. As the scripts construct the target string T from left 
to right by adding segments, dynamic programming is an ideal tool for com- 
puting the transformation distance. Each added segment is a result of a copy, 
reverse copy or an insertion. Algorithm 1, determines the transformation dis- 
tance between S and T by a dynamic programming algorithm (figure 1). Let 
C[k] be the minimum production cost of T[l..k] using the segments of S. We 
make use of generic functions CopyCost, RevCopyCost and InsertCost as defined 
at the end of section 2. In order to fix ideas, one can consider that these costs 
are proportional to the length of the searched segment (and oo if this segment 
does not occur in S). In fact any sub-additive function would be convenient. 

Deciding whether a given substring of T exists in S or not, and finding its 
position in the case of presence, needs to apply a string matching algorithm. 
The design of string matching part of algorithm 1 is based on KMP (Knutt- 
Moris-Pratt) string matching algorithm with some changes. We need to recall 
the definition of prefix function tt (adapted from the original KMP one), which 
is computed in ComputePreflxFunction (called in line 7). Given a pattern 
P[l..m\, the prefix function for pattern P is the function tt : {l,2,...,m} ^ 

^ In this paper we use segment as an equivalent word for substring. 

® Although this measure is not a mathematical distance but we will use the term 
transformation distance which was introduced by Varre et al. [12, 11]. 
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{0, 1, . . . , TO — 1} such that 7r[<7] = max{fc : k < q and Pk 3 Pq\- That is, Hq is 
the length of the longest prefix of P that is a proper suffix of Pq. We have the 
following lemma for the prefix functions. 

Lemma 1 . The prefix function of Pk is a restriction of prefix function of P to 
the set {1, 2, . . . , fc}. 

Proof: The proof is immediate by the definition of the prefix function because 
7 t[z] for a given i can be obtained only from Pi-i = P\l..{i — 1)] and P\i]. □ 

Although simple, this lemma is a corner-stone of the algorithm. It shows 
that, one can search for the presence of the prefixes of a pattern string in the 
source string, in the same time of searching for the complete pattern, without 
increasing the complexity of the search. The lines 8-13 of the algorithm determine 
the existence of the prefixes of pattern P in S~^. While S is scanned from right- 
to-left (loop line 9), q is the length of longest prefix of P which is a suffix of 
— z -I- 1] in line 14. Note that when we are searching for existences of 
prefixes of P in S~^, in fact we are searching for the existence of suffixes of 
T[l..k] in S. 

The complexity of these lines 8-14 is 0 {n) in time and space. Computation 
of 7T needs 0(n) in time and space (line 7). For the proof of the complexity and 
correctness of lines 6-13, see chapter 34.4 of [3]. 



Algorithm 1 TransformationDistance(S, T) 

1 . C[0] ^ 0 

2. for k ^ 1 to |T| do 

3. C[k] ^ oo 

4- for i ^ 1 to k do 

5. C[k] <— min{C[fc], C[i — 1] -f InsertCost{T[i..k])} 

6. P^T[l..k]-^ 

7. ComputePrefixFunction(P, tt) 

8 . 

9. for z ^ ISI downto 1 do 

10. while q> Q and P\q -f 1] 7^ S[i] do 

11. q^Tv[q] 

12. if g < |P| and P[q + 1] = S'fz] then 

13. q ^ q + 1 

14 . C[k] ^ min{C[fc], C[k — g] -I- CopyCost{T[{k — q + l)..fc])} 

15. repeat lines 8. .14 replacing S and CopyCost by S~^ and RevCopyCost respectively 

16. return C[n] 



Fig. 1. Transformation Distance: a dynamic programming solution 



Proposition 1 Algorithm 1 correctly determines the transformation distance of 
S and T. 

Proof: We prove by induction on k that after the algorithm execution, C\k] 
contains the minimum production cost of target T[l..k] with the source string 
S. C[0] is initialized to 0, because the cost of production of e from S is zero. 
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Now, we suppose that C[i] is determined correctly for all i < fc for some 
positive value of k. Let us consider the calculation of C[k\. The last operation 
in a minimal script which generates T\l..k], creates a sufhx of T[l..k]. Let this 
suffix be T[i..k] (See figure 2). As the script is minimal, the script without its 
last operation is a minimal script for — 1)]. The minimum cost of the 

script for T[l..{i — 1)] is C[i — 1] by induction hypothesis. If T[i..k] exists in S, 
then q will be equal to fc — t -I- 1 in some moment during the algorithm execution 
in line 14 (|T[L.fc]| = k — i + 1). If T[i..k] exists in S and the last operation 
of the minimal script is a copy operation, the minimal cost of the script is 
C[i — 1] + CopyCost{T[i..k]) (note that q = k — i+l amounts toi—l = k — qm 
line 14). Similarly, if the last operation in the minimal script of T[l..fc] is a reverse 
copy operation, the minimal cost of the script is C[i— 1] + RevCopyCost(T[i..k]) 
(line 15). Finally, if the last operation in the minimal script of T[l..k] is an 
insertion, the minimal cost of the script is C[i — 1] + InsertCost{T[i..k]) (lines 
4-5). Thus, C[n] is the minimum cost of production of T = T[l..n] and the 
algorithm determines correctly the transformation distance of S and T. □ 



T[ 

Si 




C[i — 1] -I- CopyCost{T[i..k\) 



Copy 



T[ 

Si 




] Reverse Copy 



T[ 



i k 

I I C[i — 1] -f InsertCost{T[i..k]) 



Insertion 



C[k] — mirij{C[z-l] + Ynin{InsertCost{T[i..k]) , CopyCost{T[i..k]), RevC opyC ost{T[i . .k])}} 



Fig. 2. The three different possibilities for generation of a snffix of T[l..fc] 



Note that when the length of the substring T[i..k] is smaller than MFL, 
CopyCost{T[i..k\) and RevCopyCost{T[i..k\) are equal to oo. 

The complexity of lines 6-13 is 0(n) in time and space. So the whole algorithm 
for calculation of transformation distance costs 0(n^) in time and 0(n) in space. 
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4 An Additional Operation: Deletions 

In this section, we extend the set of evolutionary operations by adding the dele- 
tion operation. During a deletion operation, one or more symbols of the string 
which is under evolution are eliminated. This is an important operation from the 
biological point of view; in the real evolution of biological sequences, in several 
cases after or during the copy operations some bases (symbols) are eliminated. 
We define a function DelCost for the cost of deletions; DelCost(x) is the cost 
of deletion of a symbol x. For simplicity, we suppose that the deletion cost of a 
segment (substring) is equal to the sum of deletion costs of its symbols. Thus 
we have DelCost{P[l..k]) = Delcost{P[i]). 

As before, our objective is to find the minimum cost for a script generating 
a target string T, with the help of segments of a source string S. As the costs 
are independent of time and deletion cost of a segment is the sum of deletion 
costs of its symbols, we can consider that the deletions are applied only in the 
latest added segment (rightmost one), at any moment during the evolution. It 
should be clear that in an optimal transformation, deletions are not applied into 
an inserted substring (a substring which is the result of an insertion operation) . 
Depending on the assigned costs, deletions can be used after the copy or re- 
verse copy operations. We consider a copy operation together with all deletions 
which are applied to that copied segment as a unit operation. So we have a new 
operation called NewCopy which is a copy operation followed by zero or more 
deletions on the copied segment. In figure 3 a schema of a NewCopy operation is 
illustrated. Similarly, NewRevCopy is a reverse copy operation followed by zero 
or more deletions. 




Deleted segments 



Copy{S[h..k]) 

-f 

Delete{S[l2..h]) 

-f 

Delete{S[l4..ls]) 



NewCopy{T[i..k]) 



Fig. 3. The illustration of NewCopy operation: A copy operation -t zero or more dele- 
tions 



Solving the extended transformation distance with the deletions, amounts 
to solving the transformation distance with the following 3 operations: Inser- 
tion, NewCopy and NewRevCopy. A substring T[i..j] of the target string can 
be produced by a unique NewCopy operation if and only if T[i..j] is a sub- 
sequence string of source S. Conversely, T[i..j] can be produced by a unique 
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NewRevCopy operation if and only if is a subsequence string of the 

source S. During the algorithm, we will need to the minimum generation cost 
by a NewCopy or NewRevCopy operation, for any substring of the target string 
T. This could be done as a preprocessing part, but for decreasing the space 
complexity we integrate this part in the core of the algorithm without increas- 
ing the total time complexity. For this aim, first we design a function called 
ComputeNewCopyCost(P, S) which fills a table Cost, with the following defi- 
nition: Cost[i\ is the minimum cost of generating by a NewCopy operation 

using a source string S, if is a subsequence of S and oo otherwise (see 

figure 4). 

We denote by optimal supersequence of any substring of S which is 

(a) a supersequence of and (b) has the minimum deletion cost among 

all these supersequences. If 5'[l..fc] is a supersequence for P[l..i], the cost of 
generating P[l..z] from S[l..k] by a NewCopy operation is CopyCost{S[l..k]) + 
DelCost{S[l..k]) — DelCost{P[l..i]). The difference between the last two terms 
of this expression is the deletion cost of useless (extra) symbols. A necessary 
condition for optimality is S[l] = P[l] and S'[fc] = P[i\- Before giving a proof of 
correctness of Algorithm 2, we state the following lemma. 

Lemma 2. If S[l..k] is the optimal supersequence for P[l..i] over S[1..N], then 
it is the rightmost supersequence for P[l..z] on S'[l..fc]. 

Proof: S[l..k] is the optimal supersequence for T[i..j] over 5'[l..fc] then it has 
smaller deletion cost than all S[l'..k] for V < I and no S[l” ..k] can be a superse- 
quence for I" <1. □ 



Algorithm 2 ComputeNewCopyCost(P, S) 

1. FillArray(Cost, cxa) 

2. FillArray(LastOcc, oo) 

3. Cost[Q] ^ 0 

f. for fc ^ 1 to l^l 

5. for each i ^ \P\ downto 2 

6. if S[k] = P[i] and LastOcc[i — 1] < oo then 

7. LastOcc[i] <— LastOcc[i — 1] 

8. DifDel ^ DelCost{S[LastOcc[i\..k\) — DelCost{P\l..i\) 

9. Cost[i\ <— mm{Cost[i], CopyCost{S[LastOcc[i]..k]) + DifDel} 

10. if Plfc] = P[l] then 

11. LastOcc[l] ^ k 

12. Cost[l] ^ C opyC ost{P[l]) 



Fig. 4. ComputeNewCopyCost 



Proposition 2 ComputeNewCopyCost(P,S) (given in figure 4) determines cor- 
rectly in Cost[i], the minimum generation cost o/P[l..z] by a NewCopy operation 
from a source S, for all i < |P|. 
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Rather than giving a formal proof for proposition 2, we will explain how the 
pseudocode of figure 4 works. The tables Cost and LastOcc are initialized by 
oo (lines 1-2). The algorithm scans the source from left to right to find the 
optimal supersequence for each prefix of P. The algorithm uses the auxiliary 
table LastOcc for this aim. 

After the k-th letter of S is processed (loop of line 4), the following is true: 
LastOcc[i] is the largest I < k such that S[l..k] is a supersequence of P[l..i] or oo 
if no such I exists. The loop on P (line 5) is processed with decreasing indices for 
memory optimization. Whenever the letter S')/::] occurs in i-th position in P and 
LastOcc[i — 1] < oo which means P[\..i — 1] has a supersequence in S')!../? — 1] 
(line 6), then there is an opportunity of obtaining a better supersequence for 
P[l..i]. LastOcc[i] takes the value of LastOcc[i — 1] (computed for k — 1) since 
S[LastOcc[i — l]..fc] is now the rightmost supersequence for P[l..i] (line 7). Its 
cost is compared to the best previous one; if better, the new cost is stored 
in Cost[i] (lines 8-9). One should observe that the rightmost supersequences 
are updated only when a new common letter is scanned. This is necessary and 
sufficient as stated in the lemma 2. Note that the process oi i = 1 is done 
separately (lines 10-12). 

Generating the target string T from left to right, the rightmost added seg- 
ment is a result of an insertion, NewCopy or NewRevCopy operation. The fol- 
lowing algorithm determines the extended transformation distance of target T 
from source S' by a dynamic programming algorithm. C[k] is by definition the ex- 



Algorithm 3 ExtendedTransformationDistance(S, T) 

1 . G[0] ^ 0 

2. for k ^ 1 to |T| do 

3. C[k] ^ 00 

4- for i ^ 1 to k do 

5. C[k] ^ min{C[k], C[i — 1] -I- InsertCost{T[i..k]} 

6. P^T[l..k]-^ 

7. ComputeNewCopyCost(P, S“^) 

8. for i 1 to k do 

9. C[k] ^ min{G[fc], C[i — 1] -f Cost[k — i -|- 1]} 

10. ComputeNewRevCopyCost(P, S) 

11. for i ^ 1 to k do 

12. C[k] <— min{G[fc], C[i — 1] -f Cost[k — i -I- 1]} 

13. return C[n] 



Fig. 5. Extended Transformation Distance: a dynamic programming solntion 



tended transformation distance of target string T[l..k] from source string S. The 
different possibilities for generation of the rightmost added segment of T\l..k] 
are considered at lines 4-12. Comp uteNewRevCopy Cost is very similar to 
ComputeNewCopyCost. For a NewCopy operation, as P is the reverse of T[l..k], 
we need to search in S~^ (and not in S) for optimal supersequences (line 7). The 
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proof of correctness of Algorithm 3, can be done by induction very similar to 
the proof of proposition 1. 

The complexity of ComputeNewCopyCost is O(n^) in time and 0(n) in 
space. So, the total complexity of determination of the extended transforma- 
tion distance is O(n^) in time and 0(n) in space. 

5 Biological Justification 

In this section, we show by using biological sequences that the extended trans- 
formation distance can be more realistic distance than transformation distance 
on the real biological sequences. For this aim, the data we use consists of partial 
DNA sequences which participate in coding of RNA 16s. These sequences are 
known to be good phylogenetic markers, because they evolve very slowly in gen- 
eral. Here we consider only three sequences of the data which correspond to three 
species: Trichoniscus pusillus, Haplaphtalmus mengei and Aselles aquaticus. The 
two first are from the same family (Trichoniscoidea) while the third is from the 
Aselidea family. The first family is a terrestrial family and the last family is an 
aquatic family. Although one expects that the sequences of T.pusillus should be 
more similar to H. mengei than to A. aquaticus, the transformation distance is 
unable to capture this relative similarity. In the other terms, the transformation 
distance from A. aquaticus into T.pusillus is smaller than the transformation dis- 
tance from H. mengei into T.pusillus for the different choices of parameters for 
MFL and cost functions (This is confirmed in [10]). The extended transforma- 
tion distance solves this problem. The following table shows the corresponding 
transformation and extended transformation distances. The MFL is 9 in this 
example. This shows us that deletions make the model more robust on the real 
data. 

Table 1. Transformation and extended transformation distance from A.aquati and 
H. mengei into T.pusillus 





Transformation Distance 


Extended Transformation Distance 


H. mengei 


1322 


516 


A. aquaticus 


1189 


522 



Remarks and Conclusion 

In this paper, we presented a new improved algorithm for calculation of the trans- 
formation distance problem. This question is central in the study of genome evo- 
lution. We largely improve the running time complexity (from 0{n^) to 0(n^)) 
thus allowing to treat much longer sequences (typically 10000 symbols instead of 
100) in the same time, while using only linear space. We also gave an algorithm 
for the transformation distance problem in presence of the deletion operations 
which gives to the model its full generality. In this version, costs have been given 
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a special additive form for clarity. In fact a number of variations are possible 
within our framework: the main property needed on costs seems to be their 
subadditivity in which case our algorithms are correct. 

If the DelCost function is a constant function over different symbols, which 
means that the deletions of any two symbols have the same cost, the optimal 
super-sequence problem becomes the shortest supersequence problem. This prob- 
lem called Episode Matching is studied in several papers [9,4]. For our particular 
purpose, the complexity we obtained by Algorithm 2 is better than the one that 
could achieved by algorithms using the best known episode matching algorithms. 

In this paper, we state that Algorithm 2 complexity is O(n^); this stands for 
the worst case complexity; in fact only a small proportion of pairs {S[k],T[j]) 
imply running the inner loop. Under certain additional statistical hypotheses 
the average complexity could be less than O(n^): in particular if the alphabet 
size is of the order of the string lengths, the average cost falls down to 0(n) for 
Algorithm 2 and thus O(n^) for Algorithm 3. 

Different implementations of our algorithms can be considered. In Algorithm 
2, if for each symbol in string P we store the last occurrence of this symbol in the 
string (for example by adding a pre-processing part), the loop of lines 5-6 can 
pass only on these symbols, which yields a better experimental complexity. In 
Algorithm 1, one can use (generalized) suffix trees for the purpose of substrings 
searching, but the theoretical complexity is not improved. 

In some variants of the transformation distance problem the offsets (indices) 
of copied segments in one or both of the source and target strings participate in 
the computation of the operation cost. Our algorithm can be adapted easily to 
solve these variants as well, because the substring (and subsequence) existence 
testings are realized in the core of algorithm (and not in the preprocessing). So 
one can search the indices minimizing the cost function. In some cases for general 
cost functions an additional 0{n) time is necessary but the space complexity 
remains linear. We will not enter in the details here. 

Different directions can be considered for the future works on this problem. 
Different evolutionary operation sets. Different cost functions and considering 
some limits on the number of times that a source segment can be copied are 
some of the interesting ones. 
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Abstract. In document filtering and content-based routing the aim is to 
transmit to the user only those documents that match the user’s interests 
or profile. As filtering systems are deployed on the Internet, the number 
of users can become large. In this paper we focus on the question of how 
a large set of user profiles can be quickly searched in order to find those 
that are relevant to the document. In the abstract setting we assume 
that each profile is given as a regular expression, and, given a set of 
regular languages (the set of profiles), we want to determine for a given 
input string (the document) all those languages the input string belongs 
to. We analyze this problem, called the classification problem for a set 
of regular languages, and we show that in various important cases the 
problem can be solved by a small single deterministic finite automaton 
extended by conditional transitions. 



1 Introduction 

In document filtering and content-based routing (e.g. [3-5,7,9-11,13,15]) the 
aim is to transmit to the user only those documents that match the user’s in- 
terests or profile. For XML documents the profiles are defined by the XPath 
language based on a restricted form of regular expressions. (XPath also contains 
irregular parts that require other analysis methods than those for regular lan- 
guages.) XML routers in a network forward XML packets continuously from data 
producers to consumers. Each packet obtained by a router will be forwarded to a 
subset of its neighboring nodes in the network, and the forwarding decisions will 
be made according to the subscriptions of the clients given by XPath expressions. 
The number of the clients’ subscriptions, and thus the set of XPath expressions 
to be evaluated can be large; therefore it is important that the evaluation is 
efficient and scalable. 

Apart from XML routing, regular expressions are important in other routing 
environments (see [5]). For example, in the BGP4 Internet routing protocol [19] 
routers transmit to neighboring routers advertisements of how they could trans- 
mit packets to various IP addresses. The router that receives advertisements is 
allowed to define regular expressions with priorities on routing system sequences. 
The priority of an advertisement is obtained by matching it with the given set 
of regular expressions. 
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Finite automata are a natural and efficient way to represent and process 
XPath expressions (see e.g. [3, 10]). Deterministic finite automata (DFAs) are of 
course more efficient that the nondeterministic ones (NFAs), because in DFAs 
there is only one possible next state. In [10] it is reported that for a large number 
of XPath expressions (up to 1,000,000 in the tests of [10]), processing using DFAs 
was many orders of magnitude faster than using NFAs. 

The difficulty in using DFAs is their size, which can be exponential in the size 
of the NFA or the regular expression. In other words, NFAs and regular expres- 
sions are exponentially more succinct representations of regular languages than 
DFAs. Formally, there exists an infinite sequence of regular languages Li, L 2 , . . ., 
such that each L„ is described by a regular expression (or nondeterministic au- 
tomaton) of size 0(n), but any DFA that accepts L„ must have size exponential 
in n [14, 18]. It should be noted that NFAs seem to be the most succinct rep- 
resentation, because NFAs are also exponentially more succinct than regular 
expressions [8], but any regular expression can easily be transformed into an 
NFA in linear time (e.g. [12]). 

In the context of applying regular expressions to routing problems, the ques- 
tion is not only to check whether or not an input string belongs to a single lan- 
guage, but to report for a (possibly large) set of languages all those languages 
the given string belongs to. This problem, called the classification problem for 
a set of languages, is reminiscent to the lexical analysis of programming lan- 
guages, where a single DFA is constructed that extract the lexical items from 
the program text (see e.g. [2, 17]). Solving the classification problem of n regular 
languages by constructing a single DFA can result in the number of states that 
is exponential in n, even though the total number of states in the DFAs that 
correspond to the n languages is 0{n) [10]. 

The problem of the exponential size of a single DFA constructed for a set 
of XPath expressions is addressed in [10] by using a “lazy” DFA; that is, the 
complete “eager” DFA is not constructed before processing a string, but the 
usual subset construction in determinizing a NFA is applied when needed. In 
[10] an NFA from each XPath expression is constructed, and at run time, the 
processing of the NFAs is simulated by constructing those parts of the DFA that 
are reachable by the given input string. In [10] it is also demonstrated that this 
lazy evaluation can be efficient in the sense that the number of generated states 
remains small. The lazy DFA approach is further optimized in [6]. However, in 
[16] it is demonstrated that even a lazy DFA can become large when processing 
complex XML documents. 

Even though the lazy evaluation is often efficient, it is certainly of interest 
to try to avoid the construction of DFA states altogether during the analysis of 
the input string. In this paper we consider the question of how the classification 
problem for a set of regular languages can be solved by an eager DFA. We show 
that in many interesting cases related to XPath expressions, a solution based on 
a complete DFA can be made efficient, although the direct construction of the 
DFA leads to an exponential number of states. In our solution the underlying 
DFA is generalized by allowing conditional transitions. 
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2 Problem Statement 



Given a set of n regular languages, we consider the classification problem, when 
for i = 1 , . . . ,n language i is defined by a regular expression Ei of the form 



ip \r 

Zq Z l Z2 * * * “^ ^ '^n 



E*Y, 



+ 1 5 



where E denotes the whole alphabet and each Yi. is a set of strings formed by 
concatenation from individual letters in E. For j = 1 , . . . , m, we require that Y^. 
does not contain the empty string e. For Yi^ and we require that each of 

them is either {e} or does not contain e. 

For solving the classification problem, given an input string w in E* , we have 
to determine all languages L{Ex), . . . , L{En) which x belongs to. Our goal is to 
automatically generate in time 0 (|ifi| + • • • + \En\) an algorithm that has time 
complexity 0 (|rc|), where \Ei \ denotes the size of expression Ei and |t(;| denotes 
the length of w. In order to achieve this we need to place some restrictions on 
the sets Yi^, as will be defined in further sections. 

We will first consider our classification problem in the case in which for 
f = 1 , . . . , n language i is defined by a regular expression Ei of the form 



y-t* yi* 

— ^io ^ ^ ^Z2 ' ' ' ^im - ^'im -+15 

where all are pairwise different symbols in E and each of Xig and Xi^.^-^^ 
is either the empty string e or a single symbol in E. Neither Xi^ nor Xi^,^^ is 
allowed to be equal to any aj^. . 

The above simple form of language description may cause an exponential 
size in a DFA that solves the classification problem. As an example, consider the 
following three XPath expressions ([ 10 ]): 

$X1 IN $R//book//figure 
$X2 IN $R//chapter//f igure 
$X3 IN $R//tabIe//figure 

If an XML stream is processed against this set of XPath expressions using a 
single DFA (as defined e.g. in [ 10 ]), then this DFA recognizes the three regular 
languages defined by the following regular expressions (book, chapter, table, 
and figure are denoted by oi, 02, 03, and 6, respectively): 

E*aiE*b, E*Q2E*b, and E*a3E*b, 

where E denotes an alphabet containing, among other symbols, oi, 02, 03, and 
b. The states and part of the transitions of this DFA, denoted D3, is given in 
Figure 1 . In self- loops occur in all states except the final ones on all other 

symbols than shown. The loops and backward transitions from the final states 
are not shown. The DFA D3 is obtained by the usual subset construction from 
the nondeterministic automata corresponding to the regular expressions, and it 
cannot be further minimized, because the final states all accept different subsets 
of the three languages. There is a separate final state for all distinct subsets 




324 Eljas Soisalon-Soininen and Tatu Ylonen 




Fig. 1. Part of the transitions of the DFA that recognizes the languages 
L{E*aiS*b),L{S*a2S*b), and L{E*a^E*b). Transitions without an attached symbol 
are due for all other symbols than those that have a marked transition. The transitions 
from the final states are not shown. 



of {01,02,03}. That is, in a final state the DFA must remember which of the 
symbols 01,02, and 03 it has seen. In general, the minimal DFA that classifies 
the n languages L(E*aiE*b), L{S*a2S*b), . . . , L{S*anS*b)^ has 0(2”) states. 

Based on this observation, it is clear that the classification of languages 
L{Ei), , L{En), where Ei is a regular expression given as above, using a usual 
DFA is unfeasible, if n is large. Notice that the classification by final states 
means that no two final states that accept strings belonging to different subsets 
of L{Ei), . . . , L{En) cannot be combined as equivalent states. The exponential 
lower bound also holds, of course, if for all Ei the symbol after the last E* is 
missing and rrii > 1 . 

It should be noted that the languages L(E*aiE*b), . . . , L(E*anE*b) can be 
classified by a small deterministic automaton, if we relax the requirement that 
classification is performed by final states only. We may fuse all states with input 
symbol at as a single state and all final states as a single state, and classify a 
string to belong to language L{E*aiE*b), if it has arrived at a final state and 
passed the unique state with input symbol a^. In Figure 3, this DFA is shown 
for n = 3. 

3 DFAs with Conditional Transitions 

In the previous section we were able to find a small deterministic automaton that 
classifies strings to languages L{E*aiE*b). This was done simply by allowing the 
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Fig. 2. DFA that can be used to classify the languages L[S* a\S*b), L{S*a 2 S*b), 
and L(E* a^S*b). String w is in L{E* aiS*b) if it has passed a shaded hnal state with 
incoming edges labelled by ai and the computation ends at the non-shaded final state. 
Edges with no attached label denote transitions on all other symbols than the explicitly 
given. 



DFA to check whether or not a certain state has been visited after processing 
the input string. Such a test can be done efficiently by using, for example, a bit 
vector indexed by state number. 

In the more general case when the expressions are of the form of our problem 
statement it is not enough to introduce such a classifying strategy. 

Our solution is to introduce conditional transitions into DFAs such that the 
required conditions can be tested efficiently. A DFA with conditional transitions, 
denoted cDFA, has a set of states and a set transitions as usual DFAs, but a 
transition can be conditional such that it is allowed to be performed only when 
a certain condition is met. This condition is usually some simple property of the 
underlying cDFA. 

Let El, , En be regular expressions such that each Ei is of the form 



E^ = XigE*ai.E*ai, ■ ■ ■ 



( 1 ) 



where E is the set of all those symbols that appear in some Ei, all symbols 
are pairwise different, and Xi„ and Xi^.^„^ both are either e or a single symbol 
different from any . 

We consider the classification problem for L{Ei), . . . , L{En). In other words, 
given an input string w in E* , we have to determine all languages L{E\), . . . , 
L{En) which w belongs to. We construct a DFA with conditional transitions, 
denoted Me, as follows. For simplicity, we assume here that each Ei is of the 
form E*Gi^E*a^^ ■ ■ ■ E*ai^.E*. 
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(i) There is a unique initial state, denoted go, in Me- 

(ii) The set Q of states of Me is 

{ 90 , 9ii , ■ • ■ , , 92i , • ■ • , g2„i2 , ■ • ■ , 9m , • ■ • , }• 

(iii) Case 1. Let q be any state in Q that is not the initial state go- There 
is a transition qai^ qi^ for all combinations of i and j, but these transitions 
are conditional in the following way. For j > 1, transition qai. —> qi is to be 
performed, if in processing the current input string state qij_^ has already been 
visited but state qi^ has not been visited yet. For j = 1, transition goq. ^ qi^ is 
to be performed, if in processing the current input string state gq- has not been 
visited yet. Case 2. If at state g no transition as defined in Case 1 applies for 
the next input symbol a, then transition go ^ g is applied. 

(iv) For the initial state go there are transitions goap ^ gp for alH = 1, . . . n, 
and transitions qoa ^ go for all other input symbols a in S. 

(v) Me contains classification states, which are used in the following way. 
Assume that input string w has been fed to Me and the process has ended after 
consuming the whole string. If g is a classifying state and it has been visited 
during the process of w, then Me classifies w into the language containing all 
strings in E* that pass g when feeding them to Me- The states 



Qir, 



for i = 1, . . . , n, are chosen as classifying states in Me- 

Observe that by (iii) and (iv) Me is deterministic, that is, there is always 
exactly one next state, because at. ai^ always when i I or j k. Notice 

that the size of Me would be |Qp, if the conditional transitions were explicitly 
stored. But it is not necessary to store the transitions, because they can be 
directly concluded from the current state and input symbol. 

Example. Consider the regular expressions Ei = E*ai^E*ai.^E* and E 2 = 
A'*02 i £’* 022^’*, the classification problem for the languages L{Ei) and 
L{E 2 ). The corresponding cDFA has the set of states {go, gii, gi 2 , 92 i, 922 }- The 
transitions from the initial state go are goaij ^ gij, 90 O 21 ^ 92 i, 90^12 ^ 9 o, 
and 90022 ^ 9 o- The conditional transitions are ^ gi^, giia 2 i ^ g 2 i, 

9ii02i ^ 9ii, giiOi2 ^ gi2, giiOi2 ^ 9ii, 911O22 ^ 922, 911O22 — *■ 9ii, 92iaii — > 

911, 92iOii ^ 92 i, 92iai2 ^ 912, 92iai2 ^ 921, 92i02i ^ 921, 921O22 ^ 92s, 
92i022 ^ 92 i, 9i2®ii ^ 9ii, 912^111 ^ 9i25 912OI2 ^ 9i2> 9i2®2i ^ 92i, 9i2®2i ^ 

9 1 2 , 9l2®22 ^ 922, 9 I 2 O 22 ^ 9 I 2 , 92a Oil ^ 9li, 92a Oil ^ 92a, 92a OI 2 ^ 9la , 
92a OI 2 ^ 92a, 92a 02i ^ 92i , 92a 02i ^ 92a, 92a 02a ^ 92a- 

Recall that the conditional transitions need not be stored, because the pos- 
sible transitions are always implied by the state and input symbol. The unique 
applicable transition is implied by the passed states, as explained in rule (iii) of 
the construction of Me- 

An example computation: 9001 ^ 01^022022012021 gii 0 ii 022 0220 i 202 i 

^ gii022 022 0i2 02i ^ gii022 0i2 02i ^ 9 I 1 OI 2 O 21 ^ qutt2^ ^ 92i- 

In this computation the state gi 2 is visited, and thus 011 O 11 O 22 O 22 O 12 O 21 is 
classified to belong to the language L{S* 
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Let w be a string in language L{Ei) = ■ ■ ■ S*ai^.S* . Then w is of the 

form wiai^W 2 ai^---Wmiai,n +iWmi+i, where Wi is in S*. It is seen by a simple 
induction that this string is classified by Me to the language containing those 
strings that pass the classification state qi^.. On the other hand, any string 
classified by must be of the form where Wi is 

in E*. Thus, the languages classified by qi^, are exactly the languages L{Ei) = 
E*Gi^ ■ ■ ■ S*ai^. E*, for i = 1, . . . , n. 

If expression Ei has a leading symbol Oig , then the above construction must 
be changed such that there is a state qi„ with input symbol Oig between the 
initial state and state qi^ . If Ei has a last symbol then a final state qi^^ +i 

must be introduced to Me. Classification must also be changed to: string w is 
classified to language L{Ei) if and only if string w passes classifying state qi^^ 
and ends at the final state qi^^.j^^- 

We have: 

Theorem 1. Let Me be a cDFA constructed from regular expressions E\^ , 
En of the form (1). Then Me is constructed in time 0(|ifi | + . . . + |-E„|), and Me 
classifies all strings in E* into languages with respect to L{Ei), . . . , L{En). The 
time complexity of classifying string w into all languages it belongs to is 0(|w|). 



4 Classification of Strings for an Ordered Set of Patterns 



In the previous section we considered the classification problem for a rather 
restricted class of regular languages. In this section we extend the result to the 
case in which the expressions can have a much more general form. 

Let Ex,...,En be regular expressions defined on alphabet E such that each 
Ei is of the form: 






( 2 ) 



where each Yi. is a non-empty set of strings in E* . For j = 1, . . . , m, we require 
that Yi- does not contain the empty string e. For l^g and we require that 

they are either {e} or do not contain e. 

In solving this classification problem efficiently, we apply the construction of 
the previous section, and the construction of Aho and Corasick [1] for recognizing 
regular languages defined by regular expressions of the form 



E*YE 



* 



where T is a set of strings not containing the empty string. The method of [1] 
constructs a DFA in linear time from the expression. Given a set of n regu- 
lar expressions Ei, . . . , En such that each Ei is of the form (2), we apply the 
construction of [1] to the expression 

U . . . U U + l U • • ■ (3) 

Ub"„g#no U Yn^ifm U . . . U Yn^^^Umn + + ^ J 

where all are new symbols and Y-^ (resp. Y(^ is the empty set, if dig = {e} 
(resp. = {e}), and otherwise Tig (resp. Ti^,_^J. 
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That is, we construct a deterministic automaton, denoted Ma, that recognizes 
the language defined by this expression. This automaton is used as follows: An 
input string in E* not containing any symbol is fed to the automaton, and 
whenever a state is reached from which there are transitions by symbols , all 
these symbols will be output (in any order). The output sequence obtained for 
an input string w in S* is then used as an input string for a cDFA constructed 
from the expressions , . . . , E '^ , where E[ is 

E[ = ••• 

Here yi^ (resp. is e, if (resp. is the empty set, and otherwise 

#io- 

This cDFA performs the final classification. The construction works correctly, 
if and only if no non-empty suffix of a string in Yt- is a non-empty prefix of Yi ^^^ , 
for j = 0, . . . TOi -I- 1. 

We have: 

Theorem 2. Let E\,. . . ,En be regular expressions of the form (2) such that 
for all Ei no non-empty suffix of a string in is a non-empty prefix of a 
string in . Moreover, assume that for no string w in L{Ei) U . . . U L{En) the 
deterministic automaton Ma constructed from the expression (3) does not output 
more than c|w| symbols, where c is a constant. Then it is possible to construct 
in time 0{\Ei | -I- • • • -I- \En\) a deterministic program that solves the classification 
problem in linear time. That is, for all w in S* this program classifies w in time 
0(|r(;|) with respect to the languages L{Ei ), . . . , L(A„). 

5 Conclusions 

Content-based classification is based on the information in the document itself 
and not on the information in the headers of the packets to be routed. Users’ 
interests and subscriptions are typically given by regular expressions based on 
structure defining elements in the documents. The number of such expressions 
can become very large, and the classification problem cannot be solved by sim- 
ply constructing by standard methods a single deterministic automaton, which 
decides for input strings all matching expressions. One possibility is to resort to 
using nondeterministic machines, but then for all input at least 0{nk) time is 
needed, where n is number expressions and k denotes the length of the input 
string. 

In this paper, we defined a new class of regular expressions with the prop- 
erty that for sets of expressions in this class a deterministic program can be 
constructed in linear time, such that the program classifies input strings in lin- 
ear time with respect to the expressions. 

In further work we plan to define more new classes of regular expressions for 
which the classification problem can be solved efficiently. Specifically, it seems 
that some of the restrictions we now placed on the expressions can be consider- 
ably relaxed. 
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